You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Ryan Blue (JIRA)" <ji...@apache.org> on 2018/07/31 21:03:00 UTC
[jira] [Comment Edited] (SPARK-24882) data source v2 API improvement

    [ https://issues.apache.org/jira/browse/SPARK-24882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16564362#comment-16564362 ] 

Ryan Blue edited comment on SPARK-24882 at 7/31/18 9:02 PM:
------------------------------------------------------------

{quote}the problem is then we need to make `CatalogSupport` a must-have for data sources instead of an optional plugin
{quote}
Data sources are read and write implementations. Catalog support should be a layer above read/write implementation that is used to provide CTAS and other table-level support.

If you're interested in the anonymous table use case from the email discussion, I posted a suggestion there to add an {{anonymousTable}} function to {{DataSourceV2}}. That allows a source instantiated directly through v1-style reflection to provide a {{Table}} based on an options map. Then that table would implement {{ReadSupport}} and {{WriteSupport}} as I've suggested in this thread. That would preserve the ability to instantiate a source directly and use it, and would center around a {{Table}} that implements the read and write traits.

An alternative to the {{anonymousTable}} method is what I did in the WIP pull request for CTAS. In that PR, I created two ways to work with {{DataSourceV2}}: through the existing {{DataSourceV2Relation}} and through a new {{TableV2Relation}}. The first is for {{DataSourceV2}} instances that implement the read and write traits, while the latter is for {{Table}} objects that implement them. Either way works, though it would be cleaner to just use {{Table}}.

 

Thanks for the builder update! Immutability is the most important part, but I'd still prefer a builder interface with default methods instead of the mix-in traits.


was (Author: rdblue):
{quote}the problem is then we need to make `CatalogSupport` a must-have for data sources instead of an optional plugin
{quote}
Data sources are read and write implementations. Catalog support should be a layer above read/write implementation that is used to provide CTAS and other table-level support. If you're interested in the anonymous table use case from the email discussion, I posted a suggestion there to add an {{anonymousTable}} function to {{DataSourceV2}}. That allows a source instantiated directly through v1-style reflection to provide a {{Table}} based on an options map. Then that table would implement {{ReadSupport}} and {{WriteSupport}} as I've suggested in this thread. That would preserve the ability to instantiate a source directly and use it, and would center around a {{Table}} that implements the read and write traits.

An alternative to the {{anonymousTable}} method is what I did in the WIP pull request for CTAS. In that PR, I created two ways to work with {{DataSourceV2}}: through the existing {{DataSourceV2Relation}} and through a new {{TableV2Relation}}. The first is for {{DataSourceV2}} instances that implement the read and write traits, while the latter is for {{Table}} objects that implement them. Either way works, though it would be cleaner to just use {{Table}}.

 

Thanks for the builder update! Immutability is the most important part, but I'd still prefer a builder interface with default methods instead of the mix-in traits.

> data source v2 API improvement
> ------------------------------
>
>                 Key: SPARK-24882
>                 URL: https://issues.apache.org/jira/browse/SPARK-24882
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.4.0
>            Reporter: Wenchen Fan
>            Assignee: Wenchen Fan
>            Priority: Major
>
> Data source V2 is out for a while, see the SPIP [here|https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ-Z8qU5Frf6WMQZ6jJVM/edit?usp=sharing]. We have already migrated most of the built-in streaming data sources to the V2 API, and the file source migration is in progress. During the migration, we found several problems and want to address them before we stabilize the V2 API.
> To solve these problems, we need to separate responsibilities in the data source v2 API, isolate the stateull part of the API, think of better naming of some interfaces. Details please see the attached google doc: https://docs.google.com/document/d/1DDXCTCrup4bKWByTalkXWgavcPdvur8a4eEu8x1BzPM/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org