You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Hyukjin Kwon <gu...@gmail.com> on 2018/08/20 07:19:34 UTC

[DISCUSS] USING syntax for Datasource V2

Hi all,

I have been trying to follow `USING` syntax support since that looks
currently not supported whereas `format` API supports this. I have been
trying to understand why and talked with Ryan.

Ryan knows all the details and, He and I thought it's good to post here - I
just started to look into this.
Here is Ryan's response:


>USING is currently used to select the underlying data source
implementation directly. The string passed in USING or format in the DF API
is used to resolve an implementation class.

The existing catalog supports tables that specify their datasource
implementation, but this will not be the case for all catalogs when Spark
adds multiple catalog support. For example, a Cassandra catalog or a JDBC
catalog that exposes tables in those systems will definitely not support
users marking tables with the “parquet” data source. The catalog must have
the ability to determine the data source implementation. That’s why I think
it is valuable to think of the current ExternalCatalog as one that can
track tables with any read/write implementation. Other catalogs can’t and
won’t do that.

> In the catalog v2 API <https://github.com/apache/spark/pull/21306> I’ve
proposed, everything from CREATE TABLE is passed to the catalog. Then the
catalog determines what source to use and returns a Table instance that
uses some class for its ReadSupport and WriteSupport implementation. An
ExternalCatalog exposed through that API would receive the USING or
format string
as a table property and would return a Table that uses the correct
ReadSupport, so tables stored in an ExternalCatalog will work as they do
today.

> I think other catalogs should be able to choose what to do with the USING string.
An Iceberg <https://github.com/Netflix/iceberg> catalog might use this to
determine the underlying file format, which could be parquet, orc, or avro.
Or, a JDBC catalog might use it for the underlying table implementation in
the DB. This would make the property more of a storage hint for the
catalog, which is going to determine the read/write implementation anyway.

> For cases where there is no catalog involved, the current plan is to use
the reflection-based approach from v1 with the USING or format string. In
v2, that should resolve a ReadSupportProvider, which is used to create a
ReadSupport directly from options. I think this is a good approach for
backward-compatibility, but it can’t provide the same features as a
catalog-based table. Catalogs are how we have decided to build reliable
behavior for CTAS and the other standard logical plans
<https://docs.google.com/document/d/1gYm5Ji2Mge3QBdOliFV5gSPTKlX4q1DCBXIkiyMv62A/edit?ts=5a987801#heading=h.m45webtwxf2d>.
CTAS is a create and then an insert, and a write implementation alone can’t
provide that create operation.

I was targeting the last case (where there is no catalog involved) in
particular. I was thinking that approach is also good since `USING` syntax
compatibility should be kept anyway - this should reduce migration cost as
well. Was wondering about what you guys think about this.
If you guys could think the last case should be supported anyway, I was
thinking we could just orthogonally proceed. If you guys think other issues
should be resolved first, I think we (at least I will) should take a look
for the set of catalog APIs.

Re: [DISCUSS] USING syntax for Datasource V2

Posted by Russell Spitzer <ru...@gmail.com>.
So the external catalogue is really more of a connector to different source
of truth fo table listings? That makes more sense.

On Tue, Aug 21, 2018 at 6:16 PM Ryan Blue <rb...@netflix.com> wrote:

> I don’t understand why a Cassandra Catalogue wouldn’t be able to store
> metadata references for a parquet table just as a Hive Catalogue can store
> references to a C* datastource.
>
> Sorry for the confusion. I’m not talking about a catalog that stores its
> information in Cassandra. I’m talking about a catalog that talks directly
> to Cassandra to expose Cassandra tables. It wouldn’t make sense because
> Cassandra doesn’t store data in Parquet files. The reason you’d want to
> have a catalog like this is to maintain a single source of truth for that
> metadata instead of using the canonical metadata for a table in the system
> that manages that table, and other metadata in Spark’s metastore catalog.
>
> You could certainly build a catalog implementation that stores its data in
> Cassandra or JDBC and supports the same tables that Spark does today.
> That’s just not what I’m talking about here.
>
> On Mon, Aug 20, 2018 at 7:31 PM Russell Spitzer <ru...@gmail.com>
> wrote:
>
>> I'm not sure I follow what the discussion topic is here
>>
>> > For example, a Cassandra catalog or a JDBC catalog that exposes tables
>> in those systems will definitely not support users marking tables with the
>> “parquet” data source.
>>
>> I don't understand why a Cassandra Catalogue wouldn't be able to store
>> metadata references for a parquet table just as a Hive Catalogue can store
>> references to a C* datastource. We currently store HiveMetastore data in a
>> C* table and this allows us to store tables with any underlying format even
>> though the catalogues' implantation is written in C*.
>>
>> Is the idea here that a table can't have multiple underlying formats in a
>> given catalogue? And the USING can then be used on read to force a
>> particular format?
>>
>> > I think other catalogs should be able to choose what to do with the
>> USING string
>>
>> This makes sense to me, but i'm not sure why any catalogue would want to
>> ignore this?
>>
>> It would be helpful to me to have a few examples written out if that is
>> possible with Old Implementation and New Implementation
>>
>> Thanks for your time,
>> Russ
>>
>> On Mon, Aug 20, 2018 at 11:33 AM Ryan Blue <rb...@netflix.com.invalid>
>> wrote:
>>
>>> Thanks for posting this discussion to the dev list, it would be great to
>>> hear what everyone thinks about the idea that USING should be a
>>> catalog-specific storage configuration.
>>>
>>> Related to this, I’ve updated the catalog PR, #21306
>>> <https://github.com/apache/spark/pull/21306>, to include an
>>> implementation that translates from the v2 TableCatalog API
>>> <https://github.com/apache/spark/pull/21306/files#diff-a9d913d11630b965ef5dd3d3a02ca452>
>>> to the current catalog API. That shows how this would fit together with v1,
>>> at least for the catalog part. This will enable all of the new query plans
>>> to be written to the TableCatalog API, even if they end up using the
>>> default v1 catalog.
>>>
>>> On Mon, Aug 20, 2018 at 12:19 AM Hyukjin Kwon <gu...@gmail.com>
>>> wrote:
>>>
>>>> Hi all,
>>>>
>>>> I have been trying to follow `USING` syntax support since that looks
>>>> currently not supported whereas `format` API supports this. I have been
>>>> trying to understand why and talked with Ryan.
>>>>
>>>> Ryan knows all the details and, He and I thought it's good to post here
>>>> - I just started to look into this.
>>>> Here is Ryan's response:
>>>>
>>>>
>>>> >USING is currently used to select the underlying data source
>>>> implementation directly. The string passed in USING or format in the
>>>> DF API is used to resolve an implementation class.
>>>>
>>>> The existing catalog supports tables that specify their datasource
>>>> implementation, but this will not be the case for all catalogs when Spark
>>>> adds multiple catalog support. For example, a Cassandra catalog or a JDBC
>>>> catalog that exposes tables in those systems will definitely not support
>>>> users marking tables with the “parquet” data source. The catalog must have
>>>> the ability to determine the data source implementation. That’s why I think
>>>> it is valuable to think of the current ExternalCatalog as one that can
>>>> track tables with any read/write implementation. Other catalogs can’t and
>>>> won’t do that.
>>>>
>>>> > In the catalog v2 API <https://github.com/apache/spark/pull/21306> I’ve
>>>> proposed, everything from CREATE TABLE is passed to the catalog. Then
>>>> the catalog determines what source to use and returns a Table instance
>>>> that uses some class for its ReadSupport and WriteSupport implementation.
>>>> An ExternalCatalog exposed through that API would receive the USING or
>>>> format string as a table property and would return a Table that uses
>>>> the correct ReadSupport, so tables stored in an ExternalCatalog will
>>>> work as they do today.
>>>>
>>>> > I think other catalogs should be able to choose what to do with the
>>>> USING string. An Iceberg <https://github.com/Netflix/iceberg> catalog
>>>> might use this to determine the underlying file format, which could be
>>>> parquet, orc, or avro. Or, a JDBC catalog might use it for the
>>>> underlying table implementation in the DB. This would make the property
>>>> more of a storage hint for the catalog, which is going to determine the
>>>> read/write implementation anyway.
>>>>
>>>> > For cases where there is no catalog involved, the current plan is to
>>>> use the reflection-based approach from v1 with the USING or format string.
>>>> In v2, that should resolve a ReadSupportProvider, which is used to
>>>> create a ReadSupport directly from options. I think this is a good
>>>> approach for backward-compatibility, but it can’t provide the same features
>>>> as a catalog-based table. Catalogs are how we have decided to build
>>>> reliable behavior for CTAS and the other standard logical plans
>>>> <https://docs.google.com/document/d/1gYm5Ji2Mge3QBdOliFV5gSPTKlX4q1DCBXIkiyMv62A/edit?ts=5a987801#heading=h.m45webtwxf2d>.
>>>> CTAS is a create and then an insert, and a write implementation alone can’t
>>>> provide that create operation.
>>>>
>>>> I was targeting the last case (where there is no catalog involved) in
>>>> particular. I was thinking that approach is also good since `USING` syntax
>>>> compatibility should be kept anyway - this should reduce migration cost as
>>>> well. Was wondering about what you guys think about this.
>>>> If you guys could think the last case should be supported anyway, I was
>>>> thinking we could just orthogonally proceed. If you guys think other issues
>>>> should be resolved first, I think we (at least I will) should take a look
>>>> for the set of catalog APIs.
>>>>
>>>>
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: [DISCUSS] USING syntax for Datasource V2

Posted by Ryan Blue <rb...@netflix.com.INVALID>.
I don’t understand why a Cassandra Catalogue wouldn’t be able to store
metadata references for a parquet table just as a Hive Catalogue can store
references to a C* datastource.

Sorry for the confusion. I’m not talking about a catalog that stores its
information in Cassandra. I’m talking about a catalog that talks directly
to Cassandra to expose Cassandra tables. It wouldn’t make sense because
Cassandra doesn’t store data in Parquet files. The reason you’d want to
have a catalog like this is to maintain a single source of truth for that
metadata instead of using the canonical metadata for a table in the system
that manages that table, and other metadata in Spark’s metastore catalog.

You could certainly build a catalog implementation that stores its data in
Cassandra or JDBC and supports the same tables that Spark does today.
That’s just not what I’m talking about here.

On Mon, Aug 20, 2018 at 7:31 PM Russell Spitzer <ru...@gmail.com>
wrote:

> I'm not sure I follow what the discussion topic is here
>
> > For example, a Cassandra catalog or a JDBC catalog that exposes tables
> in those systems will definitely not support users marking tables with the
> “parquet” data source.
>
> I don't understand why a Cassandra Catalogue wouldn't be able to store
> metadata references for a parquet table just as a Hive Catalogue can store
> references to a C* datastource. We currently store HiveMetastore data in a
> C* table and this allows us to store tables with any underlying format even
> though the catalogues' implantation is written in C*.
>
> Is the idea here that a table can't have multiple underlying formats in a
> given catalogue? And the USING can then be used on read to force a
> particular format?
>
> > I think other catalogs should be able to choose what to do with the
> USING string
>
> This makes sense to me, but i'm not sure why any catalogue would want to
> ignore this?
>
> It would be helpful to me to have a few examples written out if that is
> possible with Old Implementation and New Implementation
>
> Thanks for your time,
> Russ
>
> On Mon, Aug 20, 2018 at 11:33 AM Ryan Blue <rb...@netflix.com.invalid>
> wrote:
>
>> Thanks for posting this discussion to the dev list, it would be great to
>> hear what everyone thinks about the idea that USING should be a
>> catalog-specific storage configuration.
>>
>> Related to this, I’ve updated the catalog PR, #21306
>> <https://github.com/apache/spark/pull/21306>, to include an
>> implementation that translates from the v2 TableCatalog API
>> <https://github.com/apache/spark/pull/21306/files#diff-a9d913d11630b965ef5dd3d3a02ca452>
>> to the current catalog API. That shows how this would fit together with v1,
>> at least for the catalog part. This will enable all of the new query plans
>> to be written to the TableCatalog API, even if they end up using the
>> default v1 catalog.
>>
>> On Mon, Aug 20, 2018 at 12:19 AM Hyukjin Kwon <gu...@gmail.com>
>> wrote:
>>
>>> Hi all,
>>>
>>> I have been trying to follow `USING` syntax support since that looks
>>> currently not supported whereas `format` API supports this. I have been
>>> trying to understand why and talked with Ryan.
>>>
>>> Ryan knows all the details and, He and I thought it's good to post here
>>> - I just started to look into this.
>>> Here is Ryan's response:
>>>
>>>
>>> >USING is currently used to select the underlying data source
>>> implementation directly. The string passed in USING or format in the DF
>>> API is used to resolve an implementation class.
>>>
>>> The existing catalog supports tables that specify their datasource
>>> implementation, but this will not be the case for all catalogs when Spark
>>> adds multiple catalog support. For example, a Cassandra catalog or a JDBC
>>> catalog that exposes tables in those systems will definitely not support
>>> users marking tables with the “parquet” data source. The catalog must have
>>> the ability to determine the data source implementation. That’s why I think
>>> it is valuable to think of the current ExternalCatalog as one that can
>>> track tables with any read/write implementation. Other catalogs can’t and
>>> won’t do that.
>>>
>>> > In the catalog v2 API <https://github.com/apache/spark/pull/21306> I’ve
>>> proposed, everything from CREATE TABLE is passed to the catalog. Then
>>> the catalog determines what source to use and returns a Table instance
>>> that uses some class for its ReadSupport and WriteSupport implementation.
>>> An ExternalCatalog exposed through that API would receive the USING or
>>> format string as a table property and would return a Table that uses
>>> the correct ReadSupport, so tables stored in an ExternalCatalog will
>>> work as they do today.
>>>
>>> > I think other catalogs should be able to choose what to do with the
>>> USING string. An Iceberg <https://github.com/Netflix/iceberg> catalog
>>> might use this to determine the underlying file format, which could be
>>> parquet, orc, or avro. Or, a JDBC catalog might use it for the
>>> underlying table implementation in the DB. This would make the property
>>> more of a storage hint for the catalog, which is going to determine the
>>> read/write implementation anyway.
>>>
>>> > For cases where there is no catalog involved, the current plan is to
>>> use the reflection-based approach from v1 with the USING or format string.
>>> In v2, that should resolve a ReadSupportProvider, which is used to
>>> create a ReadSupport directly from options. I think this is a good
>>> approach for backward-compatibility, but it can’t provide the same features
>>> as a catalog-based table. Catalogs are how we have decided to build
>>> reliable behavior for CTAS and the other standard logical plans
>>> <https://docs.google.com/document/d/1gYm5Ji2Mge3QBdOliFV5gSPTKlX4q1DCBXIkiyMv62A/edit?ts=5a987801#heading=h.m45webtwxf2d>.
>>> CTAS is a create and then an insert, and a write implementation alone can’t
>>> provide that create operation.
>>>
>>> I was targeting the last case (where there is no catalog involved) in
>>> particular. I was thinking that approach is also good since `USING` syntax
>>> compatibility should be kept anyway - this should reduce migration cost as
>>> well. Was wondering about what you guys think about this.
>>> If you guys could think the last case should be supported anyway, I was
>>> thinking we could just orthogonally proceed. If you guys think other issues
>>> should be resolved first, I think we (at least I will) should take a look
>>> for the set of catalog APIs.
>>>
>>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>

-- 
Ryan Blue
Software Engineer
Netflix

Re: [DISCUSS] USING syntax for Datasource V2

Posted by Russell Spitzer <ru...@gmail.com>.
I'm not sure I follow what the discussion topic is here

> For example, a Cassandra catalog or a JDBC catalog that exposes tables in
those systems will definitely not support users marking tables with the
“parquet” data source.

I don't understand why a Cassandra Catalogue wouldn't be able to store
metadata references for a parquet table just as a Hive Catalogue can store
references to a C* datastource. We currently store HiveMetastore data in a
C* table and this allows us to store tables with any underlying format even
though the catalogues' implantation is written in C*.

Is the idea here that a table can't have multiple underlying formats in a
given catalogue? And the USING can then be used on read to force a
particular format?

> I think other catalogs should be able to choose what to do with the USING
 string

This makes sense to me, but i'm not sure why any catalogue would want to
ignore this?

It would be helpful to me to have a few examples written out if that is
possible with Old Implementation and New Implementation

Thanks for your time,
Russ

On Mon, Aug 20, 2018 at 11:33 AM Ryan Blue <rb...@netflix.com.invalid>
wrote:

> Thanks for posting this discussion to the dev list, it would be great to
> hear what everyone thinks about the idea that USING should be a
> catalog-specific storage configuration.
>
> Related to this, I’ve updated the catalog PR, #21306
> <https://github.com/apache/spark/pull/21306>, to include an
> implementation that translates from the v2 TableCatalog API
> <https://github.com/apache/spark/pull/21306/files#diff-a9d913d11630b965ef5dd3d3a02ca452>
> to the current catalog API. That shows how this would fit together with v1,
> at least for the catalog part. This will enable all of the new query plans
> to be written to the TableCatalog API, even if they end up using the
> default v1 catalog.
>
> On Mon, Aug 20, 2018 at 12:19 AM Hyukjin Kwon <gu...@gmail.com> wrote:
>
>> Hi all,
>>
>> I have been trying to follow `USING` syntax support since that looks
>> currently not supported whereas `format` API supports this. I have been
>> trying to understand why and talked with Ryan.
>>
>> Ryan knows all the details and, He and I thought it's good to post here -
>> I just started to look into this.
>> Here is Ryan's response:
>>
>>
>> >USING is currently used to select the underlying data source
>> implementation directly. The string passed in USING or format in the DF
>> API is used to resolve an implementation class.
>>
>> The existing catalog supports tables that specify their datasource
>> implementation, but this will not be the case for all catalogs when Spark
>> adds multiple catalog support. For example, a Cassandra catalog or a JDBC
>> catalog that exposes tables in those systems will definitely not support
>> users marking tables with the “parquet” data source. The catalog must have
>> the ability to determine the data source implementation. That’s why I think
>> it is valuable to think of the current ExternalCatalog as one that can
>> track tables with any read/write implementation. Other catalogs can’t and
>> won’t do that.
>>
>> > In the catalog v2 API <https://github.com/apache/spark/pull/21306> I’ve
>> proposed, everything from CREATE TABLE is passed to the catalog. Then
>> the catalog determines what source to use and returns a Table instance
>> that uses some class for its ReadSupport and WriteSupport implementation.
>> An ExternalCatalog exposed through that API would receive the USING or
>> format string as a table property and would return a Table that uses the
>> correct ReadSupport, so tables stored in an ExternalCatalog will work as
>> they do today.
>>
>> > I think other catalogs should be able to choose what to do with the
>> USING string. An Iceberg <https://github.com/Netflix/iceberg> catalog
>> might use this to determine the underlying file format, which could be
>> parquet, orc, or avro. Or, a JDBC catalog might use it for the
>> underlying table implementation in the DB. This would make the property
>> more of a storage hint for the catalog, which is going to determine the
>> read/write implementation anyway.
>>
>> > For cases where there is no catalog involved, the current plan is to
>> use the reflection-based approach from v1 with the USING or format string.
>> In v2, that should resolve a ReadSupportProvider, which is used to
>> create a ReadSupport directly from options. I think this is a good
>> approach for backward-compatibility, but it can’t provide the same features
>> as a catalog-based table. Catalogs are how we have decided to build
>> reliable behavior for CTAS and the other standard logical plans
>> <https://docs.google.com/document/d/1gYm5Ji2Mge3QBdOliFV5gSPTKlX4q1DCBXIkiyMv62A/edit?ts=5a987801#heading=h.m45webtwxf2d>.
>> CTAS is a create and then an insert, and a write implementation alone can’t
>> provide that create operation.
>>
>> I was targeting the last case (where there is no catalog involved) in
>> particular. I was thinking that approach is also good since `USING` syntax
>> compatibility should be kept anyway - this should reduce migration cost as
>> well. Was wondering about what you guys think about this.
>> If you guys could think the last case should be supported anyway, I was
>> thinking we could just orthogonally proceed. If you guys think other issues
>> should be resolved first, I think we (at least I will) should take a look
>> for the set of catalog APIs.
>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: [DISCUSS] USING syntax for Datasource V2

Posted by Ryan Blue <rb...@netflix.com.INVALID>.
Thanks for posting this discussion to the dev list, it would be great to
hear what everyone thinks about the idea that USING should be a
catalog-specific storage configuration.

Related to this, I’ve updated the catalog PR, #21306
<https://github.com/apache/spark/pull/21306>, to include an implementation
that translates from the v2 TableCatalog API
<https://github.com/apache/spark/pull/21306/files#diff-a9d913d11630b965ef5dd3d3a02ca452>
to the current catalog API. That shows how this would fit together with v1,
at least for the catalog part. This will enable all of the new query plans
to be written to the TableCatalog API, even if they end up using the
default v1 catalog.

On Mon, Aug 20, 2018 at 12:19 AM Hyukjin Kwon <gu...@gmail.com> wrote:

> Hi all,
>
> I have been trying to follow `USING` syntax support since that looks
> currently not supported whereas `format` API supports this. I have been
> trying to understand why and talked with Ryan.
>
> Ryan knows all the details and, He and I thought it's good to post here -
> I just started to look into this.
> Here is Ryan's response:
>
>
> >USING is currently used to select the underlying data source
> implementation directly. The string passed in USING or format in the DF
> API is used to resolve an implementation class.
>
> The existing catalog supports tables that specify their datasource
> implementation, but this will not be the case for all catalogs when Spark
> adds multiple catalog support. For example, a Cassandra catalog or a JDBC
> catalog that exposes tables in those systems will definitely not support
> users marking tables with the “parquet” data source. The catalog must have
> the ability to determine the data source implementation. That’s why I think
> it is valuable to think of the current ExternalCatalog as one that can
> track tables with any read/write implementation. Other catalogs can’t and
> won’t do that.
>
> > In the catalog v2 API <https://github.com/apache/spark/pull/21306> I’ve
> proposed, everything from CREATE TABLE is passed to the catalog. Then the
> catalog determines what source to use and returns a Table instance that
> uses some class for its ReadSupport and WriteSupport implementation. An
> ExternalCatalog exposed through that API would receive the USING or format string
> as a table property and would return a Table that uses the correct
> ReadSupport, so tables stored in an ExternalCatalog will work as they do
> today.
>
> > I think other catalogs should be able to choose what to do with the
> USING string. An Iceberg <https://github.com/Netflix/iceberg> catalog
> might use this to determine the underlying file format, which could be
> parquet, orc, or avro. Or, a JDBC catalog might use it for the underlying
> table implementation in the DB. This would make the property more of a
> storage hint for the catalog, which is going to determine the read/write
> implementation anyway.
>
> > For cases where there is no catalog involved, the current plan is to use
> the reflection-based approach from v1 with the USING or format string. In
> v2, that should resolve a ReadSupportProvider, which is used to create a
> ReadSupport directly from options. I think this is a good approach for
> backward-compatibility, but it can’t provide the same features as a
> catalog-based table. Catalogs are how we have decided to build reliable
> behavior for CTAS and the other standard logical plans
> <https://docs.google.com/document/d/1gYm5Ji2Mge3QBdOliFV5gSPTKlX4q1DCBXIkiyMv62A/edit?ts=5a987801#heading=h.m45webtwxf2d>.
> CTAS is a create and then an insert, and a write implementation alone can’t
> provide that create operation.
>
> I was targeting the last case (where there is no catalog involved) in
> particular. I was thinking that approach is also good since `USING` syntax
> compatibility should be kept anyway - this should reduce migration cost as
> well. Was wondering about what you guys think about this.
> If you guys could think the last case should be supported anyway, I was
> thinking we could just orthogonally proceed. If you guys think other issues
> should be resolved first, I think we (at least I will) should take a look
> for the set of catalog APIs.
>
>

-- 
Ryan Blue
Software Engineer
Netflix