You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@iceberg.apache.org by Anton Okolnychyi <ao...@apple.com.INVALID> on 2020/08/04 22:12:30 UTC

Re: [DISCUSS] SQL syntax extensions

During the last sync we discussed a blocker for this work raised by Carl. It was unclear how role-based control will work in the proposed approach. Specifically, how to ensure that user `X` not only has access to a stored procedure but is also allowed to execute it on table `T` where table name `T` is provided as an argument.

The Presto community can expose an API for performing security checks within stored procedures to address this. It is not ideal as it is up to the stored procedure to do all checks correctly but it solves the problem.

- Anton 

> On 29 Jul 2020, at 13:46, Ryan Blue <rb...@netflix.com.INVALID> wrote:
> 
> That looks like a good plan to me. Initially using stored procedures and adding custom syntax where possible sounds like a good way to start.
> 
> For Spark, I agree that we can start exploring a plugin that can extend Spark's syntax. Having that done will make development faster and make it easier to get this upstream, I think.
> 
> On Mon, Jul 27, 2020 at 11:14 PM Anton Okolnychyi <ao...@apple.com.invalid> wrote:
> Thanks everybody for taking a look at the doc. FYI, I’ve updated it.
> 
> I would like to share some intermediate thoughts.
> 
> 1. It seems beneficial to follow the stored procedures approach to call small actions like rollback or expire snapshots. Presto already allows connectors to define stored procedures and it will be much easier to add such syntax to other query engines as it is standard SQL. If we go that route, optional arguments and name-based arguments can make the syntax very reasonable for straightforward operations.
> 
> 2. There are still some cases where separate commands *may* make sense. For example, it may be more natural to have SNAPSHOT or MIGRATE as separate commands. That way, we can use well-known clauses like TBLPROPERTIES. Later, we may build a VACUUM command with different modes to combine 3-4 actions. We have SNAPSHOT and MIGRATE internally and they are frequently used (especially SNAPSHOT). 
> 
> 3. If we decide to build SNAPSHOT and MIGRATE as separate commands, it is unlikely we can get them into query engines even though the commands are generic. So, we may need to maintain them in Iceberg in a form of SQL extensions (e.g. extended parser via SQL extensions in Spark). That may not be always possible in all query engines.
> 
> 4. We need to align the syntax including arg names across query engines. Otherwise, it will be a mess if there is a cosmetic difference in each query engine.
> 
> 5. Spark does not have a plugin for stored procedures. There is a proposal from Ryan to add function catalog API. I think it is a bit different from the stored procedure catalog as functions are used in SELECT and procedures are used in CALL. While we can explore how to add such support to Spark, we most likely need to start with SQL extensions in Iceberg. Otherwise, we will be blocked for a long time.
> 
> 6. Wherever possible, SQL calls must return some output that should be a summary of what was done. For example, if we expire snapshots, return the number of expired snapshots, the number of removed data and metadata files, the number of scanned manifests, etc. If we import a table, output the number of imported files, etc.
> 
> 7. SQL calls must be smart. For example, we should not simply rewrite all metadata or data. Commands should analyze what needs to be rewritten. I’ve tried to outline that for metadata and will submit a doc for data compaction.
> 
> - Anton
> 
> 
>> On 23 Jul 2020, at 12:40, Anton Okolnychyi <aokolnychyi@apple.com.INVALID <ma...@apple.com.INVALID>> wrote:
>> 
>> Hi devs,
>> 
>> I want to start a discussion on whether we want to have some SQL extensions in Iceberg that should help data engineers to invoke Iceberg-specific functionality through SQL. I know companies have this internally but I would like to unify this starting from Spark 3 and share the same syntax across query engines to have a consistent behavior.
>> 
>> I’ve put together a short doc: 
>> 
>> https://docs.google.com/document/d/1Nf8c16R2hj4lSc-4sQg4oiUUV_F4XqZKth1woEo6TN8 <https://docs.google.com/document/d/1Nf8c16R2hj4lSc-4sQg4oiUUV_F4XqZKth1woEo6TN8>
>> 
>> I’d appreciate everyone’s feedback. Please, feel free to comment and add alternatives.
>> 
>> Thanks,
>> Anton 
> 
> 
> 
> -- 
> Ryan Blue
> Software Engineer
> Netflix


Re: [DISCUSS] SQL syntax extensions

Posted by Russell Spitzer <ru...@gmail.com>.
I think the moment we start touching catalyst we should be using Scala. If
in the future there is a stored procedure api in Spark we can always go
back to Java.

On Tue, Aug 25, 2020, 4:59 PM Anton Okolnychyi
<ao...@apple.com.invalid> wrote:

> One more point we should clarify before implementing: where will the SQL
> extensions live? In case of Presto, the extensions will be exposed as
> proper stored procedures and can be part of the Presto repo. In case of
> Spark, we could either keep them in a new module in Iceberg or in a
> completely different repo. Personally, I think 1st option is better as
> those extensions will be essential for everyone who uses Iceberg. Also,
> they will be optional. It is up to the users to enable them.
>
> W.r.t. the Spark extensions, we also need to clarify the language to use.
> In general, it will be easier to use Scala but we recently got rid of it to
> avoid reasoning about different Scala versions. Writing extensions in Java
> may be a bit painful. That’s why I am inclined to having a Scala module for
> Spark 3 SQL extensions in Iceberg that users can optionally enable via
> SessionExtensions. What does everybody else think?
>
> Thanks,
> Anton
>
> On 4 Aug 2020, at 15:12, Anton Okolnychyi <ao...@apple.com.INVALID>
> wrote:
>
> During the last sync we discussed a blocker for this work raised by Carl.
> It was unclear how role-based control will work in the proposed approach.
> Specifically, how to ensure that user `X` not only has access to a stored
> procedure but is also allowed to execute it on table `T` where table name
> `T` is provided as an argument.
>
> The Presto community can expose an API for performing security checks
> within stored procedures to address this. It is not ideal as it is up to
> the stored procedure to do all checks correctly but it solves the problem.
>
> - Anton
>
> On 29 Jul 2020, at 13:46, Ryan Blue <rb...@netflix.com.INVALID> wrote:
>
> That looks like a good plan to me. Initially using stored procedures and
> adding custom syntax where possible sounds like a good way to start.
>
> For Spark, I agree that we can start exploring a plugin that can extend
> Spark's syntax. Having that done will make development faster and make it
> easier to get this upstream, I think.
>
> On Mon, Jul 27, 2020 at 11:14 PM Anton Okolnychyi <
> aokolnychyi@apple.com.invalid> wrote:
>
>> Thanks everybody for taking a look at the doc. FYI, I’ve updated it.
>>
>> I would like to share some intermediate thoughts.
>>
>> 1. It seems beneficial to follow the stored procedures approach to call
>> small actions like rollback or expire snapshots. Presto already allows
>> connectors to define stored procedures and it will be much easier to add
>> such syntax to other query engines as it is standard SQL. If we go that
>> route, optional arguments and name-based arguments can make the syntax very
>> reasonable for straightforward operations.
>>
>> 2. There are still some cases where separate commands *may* make sense.
>> For example, it may be more natural to have SNAPSHOT or MIGRATE as separate
>> commands. That way, we can use well-known clauses like TBLPROPERTIES.
>> Later, we may build a VACUUM command with different modes to combine 3-4
>> actions. We have SNAPSHOT and MIGRATE internally and they are frequently
>> used (especially SNAPSHOT).
>>
>> 3. If we decide to build SNAPSHOT and MIGRATE as separate commands, it is
>> unlikely we can get them into query engines even though the commands are
>> generic. So, we may need to maintain them in Iceberg in a form of SQL
>> extensions (e.g. extended parser via SQL extensions in Spark). That may
>> not be always possible in all query engines.
>>
>> 4. We need to align the syntax including arg names across query engines.
>> Otherwise, it will be a mess if there is a cosmetic difference in each
>> query engine.
>>
>> 5. Spark does not have a plugin for stored procedures. There is a
>> proposal from Ryan to add function catalog API. I think it is a bit
>> different from the stored procedure catalog as functions are used in SELECT
>> and procedures are used in CALL. While we can explore how to add such
>> support to Spark, we most likely need to start with SQL extensions in
>> Iceberg. Otherwise, we will be blocked for a long time.
>>
>> 6. Wherever possible, SQL calls must return some output that should be a
>> summary of what was done. For example, if we expire snapshots, return the
>> number of expired snapshots, the number of removed data and metadata files,
>> the number of scanned manifests, etc. If we import a table, output the
>> number of imported files, etc.
>>
>> 7. SQL calls must be smart. For example, we should not simply rewrite all
>> metadata or data. Commands should analyze what needs to be rewritten. I’ve
>> tried to outline that for metadata and will submit a doc for data
>> compaction.
>>
>> - Anton
>>
>>
>> On 23 Jul 2020, at 12:40, Anton Okolnychyi <ao...@apple.com.INVALID>
>> wrote:
>>
>> Hi devs,
>>
>> I want to start a discussion on whether we want to have some SQL
>> extensions in Iceberg that should help data engineers to invoke
>> Iceberg-specific functionality through SQL. I know companies have this
>> internally but I would like to unify this starting from Spark 3 and share
>> the same syntax across query engines to have a consistent behavior.
>>
>> I’ve put together a short doc:
>>
>>
>> https://docs.google.com/document/d/1Nf8c16R2hj4lSc-4sQg4oiUUV_F4XqZKth1woEo6TN8
>>
>> I’d appreciate everyone’s feedback. Please, feel free to comment and add
>> alternatives.
>>
>> Thanks,
>> Anton
>>
>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>
>
>
>

Re: [DISCUSS] SQL syntax extensions

Posted by Anton Okolnychyi <ao...@apple.com.INVALID>.
One more point we should clarify before implementing: where will the SQL extensions live? In case of Presto, the extensions will be exposed as proper stored procedures and can be part of the Presto repo. In case of Spark, we could either keep them in a new module in Iceberg or in a completely different repo. Personally, I think 1st option is better as those extensions will be essential for everyone who uses Iceberg. Also, they will be optional. It is up to the users to enable them.

W.r.t. the Spark extensions, we also need to clarify the language to use. In general, it will be easier to use Scala but we recently got rid of it to avoid reasoning about different Scala versions. Writing extensions in Java may be a bit painful. That’s why I am inclined to having a Scala module for Spark 3 SQL extensions in Iceberg that users can optionally enable via SessionExtensions. What does everybody else think?

Thanks,
Anton

> On 4 Aug 2020, at 15:12, Anton Okolnychyi <ao...@apple.com.INVALID> wrote:
> 
> During the last sync we discussed a blocker for this work raised by Carl. It was unclear how role-based control will work in the proposed approach. Specifically, how to ensure that user `X` not only has access to a stored procedure but is also allowed to execute it on table `T` where table name `T` is provided as an argument.
> 
> The Presto community can expose an API for performing security checks within stored procedures to address this. It is not ideal as it is up to the stored procedure to do all checks correctly but it solves the problem.
> 
> - Anton 
> 
>> On 29 Jul 2020, at 13:46, Ryan Blue <rblue@netflix.com.INVALID <ma...@netflix.com.INVALID>> wrote:
>> 
>> That looks like a good plan to me. Initially using stored procedures and adding custom syntax where possible sounds like a good way to start.
>> 
>> For Spark, I agree that we can start exploring a plugin that can extend Spark's syntax. Having that done will make development faster and make it easier to get this upstream, I think.
>> 
>> On Mon, Jul 27, 2020 at 11:14 PM Anton Okolnychyi <aokolnychyi@apple.com.invalid <ma...@apple.com.invalid>> wrote:
>> Thanks everybody for taking a look at the doc. FYI, I’ve updated it.
>> 
>> I would like to share some intermediate thoughts.
>> 
>> 1. It seems beneficial to follow the stored procedures approach to call small actions like rollback or expire snapshots. Presto already allows connectors to define stored procedures and it will be much easier to add such syntax to other query engines as it is standard SQL. If we go that route, optional arguments and name-based arguments can make the syntax very reasonable for straightforward operations.
>> 
>> 2. There are still some cases where separate commands *may* make sense. For example, it may be more natural to have SNAPSHOT or MIGRATE as separate commands. That way, we can use well-known clauses like TBLPROPERTIES. Later, we may build a VACUUM command with different modes to combine 3-4 actions. We have SNAPSHOT and MIGRATE internally and they are frequently used (especially SNAPSHOT). 
>> 
>> 3. If we decide to build SNAPSHOT and MIGRATE as separate commands, it is unlikely we can get them into query engines even though the commands are generic. So, we may need to maintain them in Iceberg in a form of SQL extensions (e.g. extended parser via SQL extensions in Spark). That may not be always possible in all query engines.
>> 
>> 4. We need to align the syntax including arg names across query engines. Otherwise, it will be a mess if there is a cosmetic difference in each query engine.
>> 
>> 5. Spark does not have a plugin for stored procedures. There is a proposal from Ryan to add function catalog API. I think it is a bit different from the stored procedure catalog as functions are used in SELECT and procedures are used in CALL. While we can explore how to add such support to Spark, we most likely need to start with SQL extensions in Iceberg. Otherwise, we will be blocked for a long time.
>> 
>> 6. Wherever possible, SQL calls must return some output that should be a summary of what was done. For example, if we expire snapshots, return the number of expired snapshots, the number of removed data and metadata files, the number of scanned manifests, etc. If we import a table, output the number of imported files, etc.
>> 
>> 7. SQL calls must be smart. For example, we should not simply rewrite all metadata or data. Commands should analyze what needs to be rewritten. I’ve tried to outline that for metadata and will submit a doc for data compaction.
>> 
>> - Anton
>> 
>> 
>>> On 23 Jul 2020, at 12:40, Anton Okolnychyi <aokolnychyi@apple.com.INVALID <ma...@apple.com.INVALID>> wrote:
>>> 
>>> Hi devs,
>>> 
>>> I want to start a discussion on whether we want to have some SQL extensions in Iceberg that should help data engineers to invoke Iceberg-specific functionality through SQL. I know companies have this internally but I would like to unify this starting from Spark 3 and share the same syntax across query engines to have a consistent behavior.
>>> 
>>> I’ve put together a short doc: 
>>> 
>>> https://docs.google.com/document/d/1Nf8c16R2hj4lSc-4sQg4oiUUV_F4XqZKth1woEo6TN8 <https://docs.google.com/document/d/1Nf8c16R2hj4lSc-4sQg4oiUUV_F4XqZKth1woEo6TN8>
>>> 
>>> I’d appreciate everyone’s feedback. Please, feel free to comment and add alternatives.
>>> 
>>> Thanks,
>>> Anton 
>> 
>> 
>> 
>> -- 
>> Ryan Blue
>> Software Engineer
>> Netflix
>