You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Tianchen Zhang <du...@gmail.com> on 2021/05/03 18:38:52 UTC

[Spark Catalog API] Support for metadata Backup/Restore

Hi all,

Currently the user-facing Catalog API doesn't support backup/restore
metadata. Our customers are asking for such functionalities. Here is a
usage example:
1. Read all metadata of one Spark cluster
2. Save them into a Parquet file on DFS
3. Read the Parquet file and restore all metadata in another Spark cluster

From the current implementation, Catalog API has the list methods
(listDatabases, listFunctions, etc.) but they don't return enough
information in order to restore an entity (for example, listDatabases lose
"properties" of the database and we need "describe database extended" to
get them). And it only supports createTable (not any other entity
creations). The only way we can backup/restore an entity is using Spark SQL.

We want to introduce the backup and restore from an API level. We are
thinking of doing this simply by adding backup() and restore() in
CatalogImpl, as ExternalCatalog already includes all the methods we need to
retrieve and recreate entities. We are wondering if there is any concern or
drawback of this approach. Please advise.

Thank you in advance,
Tianchen

Re: [Spark Catalog API] Support for metadata Backup/Restore

Posted by Tianchen Zhang <du...@gmail.com>.

Thanks everyone for the input. Yes it makes sense that metadata
backup/restore should be done outside Spark. We will update the customers
with documentations about how that can be done and leave the
implementations to them.

Thanks,
Tianchen

On Tue, May 11, 2021 at 1:14 AM Mich Talebzadeh <mi...@gmail.com>
wrote:

> From my experience of dealing with metadata for other applications like
> Hive if needed an external database for Spark metadata would be useful.
>
> However, the maintenance and upgrade of that database should be external
> to Spark (left to the user) and as usual  some form of reliable API or JDBC
> connection will be needed from Spark to this persistent storage. Maybe a
> NoSQL DB will do.
>
> HTH,
>
> Mich
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Tue, 11 May 2021 at 08:58, Wenchen Fan <cl...@gmail.com> wrote:
>
>> That's my expectation as well. Spark needs a reliable catalog.
>> backup/restore is just implementation details about how you make your
>> catalog reliable, which should be transparent to Spark.
>>
>> On Sat, May 8, 2021 at 6:54 AM ayan guha <gu...@gmail.com> wrote:
>>
>>> Just a consideration:
>>>
>>> Is there a value in backup/restore metadata within spark? I would
>>> strongly argue if the metadata is valuable enough and persistent enough,
>>> why dont just use external metastore? It is fairly straightforward process.
>>> Also regardless you are in cloud or not, database bkp is a routine and
>>> established pattern in most organizations.
>>> You can also enhance HA and DR by having replicas across zones and
>>> regions etc etc
>>>
>>> Thoughts?
>>>
>>>
>>>
>>>
>>> On Sat, 8 May 2021 at 7:02 am, Tianchen Zhang <du...@gmail.com>
>>> wrote:
>>>
>>>> For now we are thinking about adding two methods in Catalog API, not
>>>> SQL commands:
>>>> 1. spark.catalog.backup, which backs up the current catalog.
>>>> 2. spark.catalog.restore(file), which reads the DFS file and recreates
>>>> the entities described in that file.
>>>>
>>>> Can you please give an example of exposing client APIs to the end users
>>>> in this approach? The users can only call backup or restore, right?
>>>>
>>>> Thanks,
>>>> Tianchen
>>>>
>>>> On Fri, May 7, 2021 at 12:27 PM Wenchen Fan <cl...@gmail.com>
>>>> wrote:
>>>>
>>>>> If a catalog implements backup/restore, it can easily expose some
>>>>> client APIs to the end-users (e.g. REST API), I don't see a strong reason
>>>>> to expose the APIs to Spark. Do you plan to add new SQL commands in Spark
>>>>> to backup/restore a catalog?
>>>>>
>>>>> On Tue, May 4, 2021 at 2:39 AM Tianchen Zhang <
>>>>> dustinzhang2012@gmail.com> wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> Currently the user-facing Catalog API doesn't support backup/restore
>>>>>> metadata. Our customers are asking for such functionalities. Here is a
>>>>>> usage example:
>>>>>> 1. Read all metadata of one Spark cluster
>>>>>> 2. Save them into a Parquet file on DFS
>>>>>> 3. Read the Parquet file and restore all metadata in another Spark
>>>>>> cluster
>>>>>>
>>>>>> From the current implementation, Catalog API has the list methods
>>>>>> (listDatabases, listFunctions, etc.) but they don't return enough
>>>>>> information in order to restore an entity (for example, listDatabases lose
>>>>>> "properties" of the database and we need "describe database extended" to
>>>>>> get them). And it only supports createTable (not any other entity
>>>>>> creations). The only way we can backup/restore an entity is using Spark SQL.
>>>>>>
>>>>>> We want to introduce the backup and restore from an API level. We are
>>>>>> thinking of doing this simply by adding backup() and restore() in
>>>>>> CatalogImpl, as ExternalCatalog already includes all the methods we need to
>>>>>> retrieve and recreate entities. We are wondering if there is any concern or
>>>>>> drawback of this approach. Please advise.
>>>>>>
>>>>>> Thank you in advance,
>>>>>> Tianchen
>>>>>>
>>>>> --
>>> Best Regards,
>>> Ayan Guha
>>>
>>

Re: [Spark Catalog API] Support for metadata Backup/Restore

Posted by Tianchen Zhang <du...@gmail.com>.

Thanks everyone for the input. Yes it makes sense that metadata
backup/restore should be done outside Spark. We will update the customers
with documentations about how that can be done and leave the
implementations to them.

Thanks,
Tianchen

On Tue, May 11, 2021 at 1:14 AM Mich Talebzadeh <mi...@gmail.com>
wrote:

> From my experience of dealing with metadata for other applications like
> Hive if needed an external database for Spark metadata would be useful.
>
> However, the maintenance and upgrade of that database should be external
> to Spark (left to the user) and as usual  some form of reliable API or JDBC
> connection will be needed from Spark to this persistent storage. Maybe a
> NoSQL DB will do.
>
> HTH,
>
> Mich
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Tue, 11 May 2021 at 08:58, Wenchen Fan <cl...@gmail.com> wrote:
>
>> That's my expectation as well. Spark needs a reliable catalog.
>> backup/restore is just implementation details about how you make your
>> catalog reliable, which should be transparent to Spark.
>>
>> On Sat, May 8, 2021 at 6:54 AM ayan guha <gu...@gmail.com> wrote:
>>
>>> Just a consideration:
>>>
>>> Is there a value in backup/restore metadata within spark? I would
>>> strongly argue if the metadata is valuable enough and persistent enough,
>>> why dont just use external metastore? It is fairly straightforward process.
>>> Also regardless you are in cloud or not, database bkp is a routine and
>>> established pattern in most organizations.
>>> You can also enhance HA and DR by having replicas across zones and
>>> regions etc etc
>>>
>>> Thoughts?
>>>
>>>
>>>
>>>
>>> On Sat, 8 May 2021 at 7:02 am, Tianchen Zhang <du...@gmail.com>
>>> wrote:
>>>
>>>> For now we are thinking about adding two methods in Catalog API, not
>>>> SQL commands:
>>>> 1. spark.catalog.backup, which backs up the current catalog.
>>>> 2. spark.catalog.restore(file), which reads the DFS file and recreates
>>>> the entities described in that file.
>>>>
>>>> Can you please give an example of exposing client APIs to the end users
>>>> in this approach? The users can only call backup or restore, right?
>>>>
>>>> Thanks,
>>>> Tianchen
>>>>
>>>> On Fri, May 7, 2021 at 12:27 PM Wenchen Fan <cl...@gmail.com>
>>>> wrote:
>>>>
>>>>> If a catalog implements backup/restore, it can easily expose some
>>>>> client APIs to the end-users (e.g. REST API), I don't see a strong reason
>>>>> to expose the APIs to Spark. Do you plan to add new SQL commands in Spark
>>>>> to backup/restore a catalog?
>>>>>
>>>>> On Tue, May 4, 2021 at 2:39 AM Tianchen Zhang <
>>>>> dustinzhang2012@gmail.com> wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> Currently the user-facing Catalog API doesn't support backup/restore
>>>>>> metadata. Our customers are asking for such functionalities. Here is a
>>>>>> usage example:
>>>>>> 1. Read all metadata of one Spark cluster
>>>>>> 2. Save them into a Parquet file on DFS
>>>>>> 3. Read the Parquet file and restore all metadata in another Spark
>>>>>> cluster
>>>>>>
>>>>>> From the current implementation, Catalog API has the list methods
>>>>>> (listDatabases, listFunctions, etc.) but they don't return enough
>>>>>> information in order to restore an entity (for example, listDatabases lose
>>>>>> "properties" of the database and we need "describe database extended" to
>>>>>> get them). And it only supports createTable (not any other entity
>>>>>> creations). The only way we can backup/restore an entity is using Spark SQL.
>>>>>>
>>>>>> We want to introduce the backup and restore from an API level. We are
>>>>>> thinking of doing this simply by adding backup() and restore() in
>>>>>> CatalogImpl, as ExternalCatalog already includes all the methods we need to
>>>>>> retrieve and recreate entities. We are wondering if there is any concern or
>>>>>> drawback of this approach. Please advise.
>>>>>>
>>>>>> Thank you in advance,
>>>>>> Tianchen
>>>>>>
>>>>> --
>>> Best Regards,
>>> Ayan Guha
>>>
>>

Re: [Spark Catalog API] Support for metadata Backup/Restore

Posted by Mich Talebzadeh <mi...@gmail.com>.

From my experience of dealing with metadata for other applications like
Hive if needed an external database for Spark metadata would be useful.

However, the maintenance and upgrade of that database should be external to
Spark (left to the user) and as usual  some form of reliable API or JDBC
connection will be needed from Spark to this persistent storage. Maybe a
NoSQL DB will do.

HTH,

Mich


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 11 May 2021 at 08:58, Wenchen Fan <cl...@gmail.com> wrote:

> That's my expectation as well. Spark needs a reliable catalog.
> backup/restore is just implementation details about how you make your
> catalog reliable, which should be transparent to Spark.
>
> On Sat, May 8, 2021 at 6:54 AM ayan guha <gu...@gmail.com> wrote:
>
>> Just a consideration:
>>
>> Is there a value in backup/restore metadata within spark? I would
>> strongly argue if the metadata is valuable enough and persistent enough,
>> why dont just use external metastore? It is fairly straightforward process.
>> Also regardless you are in cloud or not, database bkp is a routine and
>> established pattern in most organizations.
>> You can also enhance HA and DR by having replicas across zones and
>> regions etc etc
>>
>> Thoughts?
>>
>>
>>
>>
>> On Sat, 8 May 2021 at 7:02 am, Tianchen Zhang <du...@gmail.com>
>> wrote:
>>
>>> For now we are thinking about adding two methods in Catalog API, not SQL
>>> commands:
>>> 1. spark.catalog.backup, which backs up the current catalog.
>>> 2. spark.catalog.restore(file), which reads the DFS file and recreates
>>> the entities described in that file.
>>>
>>> Can you please give an example of exposing client APIs to the end users
>>> in this approach? The users can only call backup or restore, right?
>>>
>>> Thanks,
>>> Tianchen
>>>
>>> On Fri, May 7, 2021 at 12:27 PM Wenchen Fan <cl...@gmail.com> wrote:
>>>
>>>> If a catalog implements backup/restore, it can easily expose some
>>>> client APIs to the end-users (e.g. REST API), I don't see a strong reason
>>>> to expose the APIs to Spark. Do you plan to add new SQL commands in Spark
>>>> to backup/restore a catalog?
>>>>
>>>> On Tue, May 4, 2021 at 2:39 AM Tianchen Zhang <
>>>> dustinzhang2012@gmail.com> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> Currently the user-facing Catalog API doesn't support backup/restore
>>>>> metadata. Our customers are asking for such functionalities. Here is a
>>>>> usage example:
>>>>> 1. Read all metadata of one Spark cluster
>>>>> 2. Save them into a Parquet file on DFS
>>>>> 3. Read the Parquet file and restore all metadata in another Spark
>>>>> cluster
>>>>>
>>>>> From the current implementation, Catalog API has the list methods
>>>>> (listDatabases, listFunctions, etc.) but they don't return enough
>>>>> information in order to restore an entity (for example, listDatabases lose
>>>>> "properties" of the database and we need "describe database extended" to
>>>>> get them). And it only supports createTable (not any other entity
>>>>> creations). The only way we can backup/restore an entity is using Spark SQL.
>>>>>
>>>>> We want to introduce the backup and restore from an API level. We are
>>>>> thinking of doing this simply by adding backup() and restore() in
>>>>> CatalogImpl, as ExternalCatalog already includes all the methods we need to
>>>>> retrieve and recreate entities. We are wondering if there is any concern or
>>>>> drawback of this approach. Please advise.
>>>>>
>>>>> Thank you in advance,
>>>>> Tianchen
>>>>>
>>>> --
>> Best Regards,
>> Ayan Guha
>>
>

Re: [Spark Catalog API] Support for metadata Backup/Restore

Posted by Mich Talebzadeh <mi...@gmail.com>.

From my experience of dealing with metadata for other applications like
Hive if needed an external database for Spark metadata would be useful.

However, the maintenance and upgrade of that database should be external to
Spark (left to the user) and as usual  some form of reliable API or JDBC
connection will be needed from Spark to this persistent storage. Maybe a
NoSQL DB will do.

HTH,

Mich


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 11 May 2021 at 08:58, Wenchen Fan <cl...@gmail.com> wrote:

> That's my expectation as well. Spark needs a reliable catalog.
> backup/restore is just implementation details about how you make your
> catalog reliable, which should be transparent to Spark.
>
> On Sat, May 8, 2021 at 6:54 AM ayan guha <gu...@gmail.com> wrote:
>
>> Just a consideration:
>>
>> Is there a value in backup/restore metadata within spark? I would
>> strongly argue if the metadata is valuable enough and persistent enough,
>> why dont just use external metastore? It is fairly straightforward process.
>> Also regardless you are in cloud or not, database bkp is a routine and
>> established pattern in most organizations.
>> You can also enhance HA and DR by having replicas across zones and
>> regions etc etc
>>
>> Thoughts?
>>
>>
>>
>>
>> On Sat, 8 May 2021 at 7:02 am, Tianchen Zhang <du...@gmail.com>
>> wrote:
>>
>>> For now we are thinking about adding two methods in Catalog API, not SQL
>>> commands:
>>> 1. spark.catalog.backup, which backs up the current catalog.
>>> 2. spark.catalog.restore(file), which reads the DFS file and recreates
>>> the entities described in that file.
>>>
>>> Can you please give an example of exposing client APIs to the end users
>>> in this approach? The users can only call backup or restore, right?
>>>
>>> Thanks,
>>> Tianchen
>>>
>>> On Fri, May 7, 2021 at 12:27 PM Wenchen Fan <cl...@gmail.com> wrote:
>>>
>>>> If a catalog implements backup/restore, it can easily expose some
>>>> client APIs to the end-users (e.g. REST API), I don't see a strong reason
>>>> to expose the APIs to Spark. Do you plan to add new SQL commands in Spark
>>>> to backup/restore a catalog?
>>>>
>>>> On Tue, May 4, 2021 at 2:39 AM Tianchen Zhang <
>>>> dustinzhang2012@gmail.com> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> Currently the user-facing Catalog API doesn't support backup/restore
>>>>> metadata. Our customers are asking for such functionalities. Here is a
>>>>> usage example:
>>>>> 1. Read all metadata of one Spark cluster
>>>>> 2. Save them into a Parquet file on DFS
>>>>> 3. Read the Parquet file and restore all metadata in another Spark
>>>>> cluster
>>>>>
>>>>> From the current implementation, Catalog API has the list methods
>>>>> (listDatabases, listFunctions, etc.) but they don't return enough
>>>>> information in order to restore an entity (for example, listDatabases lose
>>>>> "properties" of the database and we need "describe database extended" to
>>>>> get them). And it only supports createTable (not any other entity
>>>>> creations). The only way we can backup/restore an entity is using Spark SQL.
>>>>>
>>>>> We want to introduce the backup and restore from an API level. We are
>>>>> thinking of doing this simply by adding backup() and restore() in
>>>>> CatalogImpl, as ExternalCatalog already includes all the methods we need to
>>>>> retrieve and recreate entities. We are wondering if there is any concern or
>>>>> drawback of this approach. Please advise.
>>>>>
>>>>> Thank you in advance,
>>>>> Tianchen
>>>>>
>>>> --
>> Best Regards,
>> Ayan Guha
>>
>

Re: [Spark Catalog API] Support for metadata Backup/Restore

Posted by Wenchen Fan <cl...@gmail.com>.

That's my expectation as well. Spark needs a reliable catalog.
backup/restore is just implementation details about how you make your
catalog reliable, which should be transparent to Spark.

On Sat, May 8, 2021 at 6:54 AM ayan guha <gu...@gmail.com> wrote:

> Just a consideration:
>
> Is there a value in backup/restore metadata within spark? I would strongly
> argue if the metadata is valuable enough and persistent enough, why dont
> just use external metastore? It is fairly straightforward process. Also
> regardless you are in cloud or not, database bkp is a routine and
> established pattern in most organizations.
> You can also enhance HA and DR by having replicas across zones and regions
> etc etc
>
> Thoughts?
>
>
>
>
> On Sat, 8 May 2021 at 7:02 am, Tianchen Zhang <du...@gmail.com>
> wrote:
>
>> For now we are thinking about adding two methods in Catalog API, not SQL
>> commands:
>> 1. spark.catalog.backup, which backs up the current catalog.
>> 2. spark.catalog.restore(file), which reads the DFS file and recreates
>> the entities described in that file.
>>
>> Can you please give an example of exposing client APIs to the end users
>> in this approach? The users can only call backup or restore, right?
>>
>> Thanks,
>> Tianchen
>>
>> On Fri, May 7, 2021 at 12:27 PM Wenchen Fan <cl...@gmail.com> wrote:
>>
>>> If a catalog implements backup/restore, it can easily expose some client
>>> APIs to the end-users (e.g. REST API), I don't see a strong reason to
>>> expose the APIs to Spark. Do you plan to add new SQL commands in Spark to
>>> backup/restore a catalog?
>>>
>>> On Tue, May 4, 2021 at 2:39 AM Tianchen Zhang <du...@gmail.com>
>>> wrote:
>>>
>>>> Hi all,
>>>>
>>>> Currently the user-facing Catalog API doesn't support backup/restore
>>>> metadata. Our customers are asking for such functionalities. Here is a
>>>> usage example:
>>>> 1. Read all metadata of one Spark cluster
>>>> 2. Save them into a Parquet file on DFS
>>>> 3. Read the Parquet file and restore all metadata in another Spark
>>>> cluster
>>>>
>>>> From the current implementation, Catalog API has the list methods
>>>> (listDatabases, listFunctions, etc.) but they don't return enough
>>>> information in order to restore an entity (for example, listDatabases lose
>>>> "properties" of the database and we need "describe database extended" to
>>>> get them). And it only supports createTable (not any other entity
>>>> creations). The only way we can backup/restore an entity is using Spark SQL.
>>>>
>>>> We want to introduce the backup and restore from an API level. We are
>>>> thinking of doing this simply by adding backup() and restore() in
>>>> CatalogImpl, as ExternalCatalog already includes all the methods we need to
>>>> retrieve and recreate entities. We are wondering if there is any concern or
>>>> drawback of this approach. Please advise.
>>>>
>>>> Thank you in advance,
>>>> Tianchen
>>>>
>>> --
> Best Regards,
> Ayan Guha
>

Re: [Spark Catalog API] Support for metadata Backup/Restore

Posted by Wenchen Fan <cl...@gmail.com>.

That's my expectation as well. Spark needs a reliable catalog.
backup/restore is just implementation details about how you make your
catalog reliable, which should be transparent to Spark.

On Sat, May 8, 2021 at 6:54 AM ayan guha <gu...@gmail.com> wrote:

> Just a consideration:
>
> Is there a value in backup/restore metadata within spark? I would strongly
> argue if the metadata is valuable enough and persistent enough, why dont
> just use external metastore? It is fairly straightforward process. Also
> regardless you are in cloud or not, database bkp is a routine and
> established pattern in most organizations.
> You can also enhance HA and DR by having replicas across zones and regions
> etc etc
>
> Thoughts?
>
>
>
>
> On Sat, 8 May 2021 at 7:02 am, Tianchen Zhang <du...@gmail.com>
> wrote:
>
>> For now we are thinking about adding two methods in Catalog API, not SQL
>> commands:
>> 1. spark.catalog.backup, which backs up the current catalog.
>> 2. spark.catalog.restore(file), which reads the DFS file and recreates
>> the entities described in that file.
>>
>> Can you please give an example of exposing client APIs to the end users
>> in this approach? The users can only call backup or restore, right?
>>
>> Thanks,
>> Tianchen
>>
>> On Fri, May 7, 2021 at 12:27 PM Wenchen Fan <cl...@gmail.com> wrote:
>>
>>> If a catalog implements backup/restore, it can easily expose some client
>>> APIs to the end-users (e.g. REST API), I don't see a strong reason to
>>> expose the APIs to Spark. Do you plan to add new SQL commands in Spark to
>>> backup/restore a catalog?
>>>
>>> On Tue, May 4, 2021 at 2:39 AM Tianchen Zhang <du...@gmail.com>
>>> wrote:
>>>
>>>> Hi all,
>>>>
>>>> Currently the user-facing Catalog API doesn't support backup/restore
>>>> metadata. Our customers are asking for such functionalities. Here is a
>>>> usage example:
>>>> 1. Read all metadata of one Spark cluster
>>>> 2. Save them into a Parquet file on DFS
>>>> 3. Read the Parquet file and restore all metadata in another Spark
>>>> cluster
>>>>
>>>> From the current implementation, Catalog API has the list methods
>>>> (listDatabases, listFunctions, etc.) but they don't return enough
>>>> information in order to restore an entity (for example, listDatabases lose
>>>> "properties" of the database and we need "describe database extended" to
>>>> get them). And it only supports createTable (not any other entity
>>>> creations). The only way we can backup/restore an entity is using Spark SQL.
>>>>
>>>> We want to introduce the backup and restore from an API level. We are
>>>> thinking of doing this simply by adding backup() and restore() in
>>>> CatalogImpl, as ExternalCatalog already includes all the methods we need to
>>>> retrieve and recreate entities. We are wondering if there is any concern or
>>>> drawback of this approach. Please advise.
>>>>
>>>> Thank you in advance,
>>>> Tianchen
>>>>
>>> --
> Best Regards,
> Ayan Guha
>

Re: [Spark Catalog API] Support for metadata Backup/Restore

Posted by ayan guha <gu...@gmail.com>.

Just a consideration:

Is there a value in backup/restore metadata within spark? I would strongly
argue if the metadata is valuable enough and persistent enough, why dont
just use external metastore? It is fairly straightforward process. Also
regardless you are in cloud or not, database bkp is a routine and
established pattern in most organizations.
You can also enhance HA and DR by having replicas across zones and regions
etc etc

Thoughts?




On Sat, 8 May 2021 at 7:02 am, Tianchen Zhang <du...@gmail.com>
wrote:

> For now we are thinking about adding two methods in Catalog API, not SQL
> commands:
> 1. spark.catalog.backup, which backs up the current catalog.
> 2. spark.catalog.restore(file), which reads the DFS file and recreates the
> entities described in that file.
>
> Can you please give an example of exposing client APIs to the end users in
> this approach? The users can only call backup or restore, right?
>
> Thanks,
> Tianchen
>
> On Fri, May 7, 2021 at 12:27 PM Wenchen Fan <cl...@gmail.com> wrote:
>
>> If a catalog implements backup/restore, it can easily expose some client
>> APIs to the end-users (e.g. REST API), I don't see a strong reason to
>> expose the APIs to Spark. Do you plan to add new SQL commands in Spark to
>> backup/restore a catalog?
>>
>> On Tue, May 4, 2021 at 2:39 AM Tianchen Zhang <du...@gmail.com>
>> wrote:
>>
>>> Hi all,
>>>
>>> Currently the user-facing Catalog API doesn't support backup/restore
>>> metadata. Our customers are asking for such functionalities. Here is a
>>> usage example:
>>> 1. Read all metadata of one Spark cluster
>>> 2. Save them into a Parquet file on DFS
>>> 3. Read the Parquet file and restore all metadata in another Spark
>>> cluster
>>>
>>> From the current implementation, Catalog API has the list methods
>>> (listDatabases, listFunctions, etc.) but they don't return enough
>>> information in order to restore an entity (for example, listDatabases lose
>>> "properties" of the database and we need "describe database extended" to
>>> get them). And it only supports createTable (not any other entity
>>> creations). The only way we can backup/restore an entity is using Spark SQL.
>>>
>>> We want to introduce the backup and restore from an API level. We are
>>> thinking of doing this simply by adding backup() and restore() in
>>> CatalogImpl, as ExternalCatalog already includes all the methods we need to
>>> retrieve and recreate entities. We are wondering if there is any concern or
>>> drawback of this approach. Please advise.
>>>
>>> Thank you in advance,
>>> Tianchen
>>>
>> --
Best Regards,
Ayan Guha

Re: [Spark Catalog API] Support for metadata Backup/Restore

Posted by Tianchen Zhang <du...@gmail.com>.

For now we are thinking about adding two methods in Catalog API, not SQL
commands:
1. spark.catalog.backup, which backs up the current catalog.
2. spark.catalog.restore(file), which reads the DFS file and recreates the
entities described in that file.

Can you please give an example of exposing client APIs to the end users in
this approach? The users can only call backup or restore, right?

Thanks,
Tianchen

On Fri, May 7, 2021 at 12:27 PM Wenchen Fan <cl...@gmail.com> wrote:

> If a catalog implements backup/restore, it can easily expose some client
> APIs to the end-users (e.g. REST API), I don't see a strong reason to
> expose the APIs to Spark. Do you plan to add new SQL commands in Spark to
> backup/restore a catalog?
>
> On Tue, May 4, 2021 at 2:39 AM Tianchen Zhang <du...@gmail.com>
> wrote:
>
>> Hi all,
>>
>> Currently the user-facing Catalog API doesn't support backup/restore
>> metadata. Our customers are asking for such functionalities. Here is a
>> usage example:
>> 1. Read all metadata of one Spark cluster
>> 2. Save them into a Parquet file on DFS
>> 3. Read the Parquet file and restore all metadata in another Spark cluster
>>
>> From the current implementation, Catalog API has the list methods
>> (listDatabases, listFunctions, etc.) but they don't return enough
>> information in order to restore an entity (for example, listDatabases lose
>> "properties" of the database and we need "describe database extended" to
>> get them). And it only supports createTable (not any other entity
>> creations). The only way we can backup/restore an entity is using Spark SQL.
>>
>> We want to introduce the backup and restore from an API level. We are
>> thinking of doing this simply by adding backup() and restore() in
>> CatalogImpl, as ExternalCatalog already includes all the methods we need to
>> retrieve and recreate entities. We are wondering if there is any concern or
>> drawback of this approach. Please advise.
>>
>> Thank you in advance,
>> Tianchen
>>
>

Re: [Spark Catalog API] Support for metadata Backup/Restore

Posted by Tianchen Zhang <du...@gmail.com>.

For now we are thinking about adding two methods in Catalog API, not SQL
commands:
1. spark.catalog.backup, which backs up the current catalog.
2. spark.catalog.restore(file), which reads the DFS file and recreates the
entities described in that file.

Can you please give an example of exposing client APIs to the end users in
this approach? The users can only call backup or restore, right?

Thanks,
Tianchen

On Fri, May 7, 2021 at 12:27 PM Wenchen Fan <cl...@gmail.com> wrote:

> If a catalog implements backup/restore, it can easily expose some client
> APIs to the end-users (e.g. REST API), I don't see a strong reason to
> expose the APIs to Spark. Do you plan to add new SQL commands in Spark to
> backup/restore a catalog?
>
> On Tue, May 4, 2021 at 2:39 AM Tianchen Zhang <du...@gmail.com>
> wrote:
>
>> Hi all,
>>
>> Currently the user-facing Catalog API doesn't support backup/restore
>> metadata. Our customers are asking for such functionalities. Here is a
>> usage example:
>> 1. Read all metadata of one Spark cluster
>> 2. Save them into a Parquet file on DFS
>> 3. Read the Parquet file and restore all metadata in another Spark cluster
>>
>> From the current implementation, Catalog API has the list methods
>> (listDatabases, listFunctions, etc.) but they don't return enough
>> information in order to restore an entity (for example, listDatabases lose
>> "properties" of the database and we need "describe database extended" to
>> get them). And it only supports createTable (not any other entity
>> creations). The only way we can backup/restore an entity is using Spark SQL.
>>
>> We want to introduce the backup and restore from an API level. We are
>> thinking of doing this simply by adding backup() and restore() in
>> CatalogImpl, as ExternalCatalog already includes all the methods we need to
>> retrieve and recreate entities. We are wondering if there is any concern or
>> drawback of this approach. Please advise.
>>
>> Thank you in advance,
>> Tianchen
>>
>

Re: [Spark Catalog API] Support for metadata Backup/Restore

Posted by Wenchen Fan <cl...@gmail.com>.

If a catalog implements backup/restore, it can easily expose some client
APIs to the end-users (e.g. REST API), I don't see a strong reason to
expose the APIs to Spark. Do you plan to add new SQL commands in Spark to
backup/restore a catalog?

On Tue, May 4, 2021 at 2:39 AM Tianchen Zhang <du...@gmail.com>
wrote:

> Hi all,
>
> Currently the user-facing Catalog API doesn't support backup/restore
> metadata. Our customers are asking for such functionalities. Here is a
> usage example:
> 1. Read all metadata of one Spark cluster
> 2. Save them into a Parquet file on DFS
> 3. Read the Parquet file and restore all metadata in another Spark cluster
>
> From the current implementation, Catalog API has the list methods
> (listDatabases, listFunctions, etc.) but they don't return enough
> information in order to restore an entity (for example, listDatabases lose
> "properties" of the database and we need "describe database extended" to
> get them). And it only supports createTable (not any other entity
> creations). The only way we can backup/restore an entity is using Spark SQL.
>
> We want to introduce the backup and restore from an API level. We are
> thinking of doing this simply by adding backup() and restore() in
> CatalogImpl, as ExternalCatalog already includes all the methods we need to
> retrieve and recreate entities. We are wondering if there is any concern or
> drawback of this approach. Please advise.
>
> Thank you in advance,
> Tianchen
>

Re: [Spark Catalog API] Support for metadata Backup/Restore

Posted by Wenchen Fan <cl...@gmail.com>.

If a catalog implements backup/restore, it can easily expose some client
APIs to the end-users (e.g. REST API), I don't see a strong reason to
expose the APIs to Spark. Do you plan to add new SQL commands in Spark to
backup/restore a catalog?

On Tue, May 4, 2021 at 2:39 AM Tianchen Zhang <du...@gmail.com>
wrote:

> Hi all,
>
> Currently the user-facing Catalog API doesn't support backup/restore
> metadata. Our customers are asking for such functionalities. Here is a
> usage example:
> 1. Read all metadata of one Spark cluster
> 2. Save them into a Parquet file on DFS
> 3. Read the Parquet file and restore all metadata in another Spark cluster
>
> From the current implementation, Catalog API has the list methods
> (listDatabases, listFunctions, etc.) but they don't return enough
> information in order to restore an entity (for example, listDatabases lose
> "properties" of the database and we need "describe database extended" to
> get them). And it only supports createTable (not any other entity
> creations). The only way we can backup/restore an entity is using Spark SQL.
>
> We want to introduce the backup and restore from an API level. We are
> thinking of doing this simply by adding backup() and restore() in
> CatalogImpl, as ExternalCatalog already includes all the methods we need to
> retrieve and recreate entities. We are wondering if there is any concern or
> drawback of this approach. Please advise.
>
> Thank you in advance,
> Tianchen
>