You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/06/10 06:59:09 UTC

[GitHub] [arrow-datafusion] jorgecarleitao opened a new issue #533: Add extension plugin to parse SQL into logical plan

jorgecarleitao opened a new issue #533:
URL: https://github.com/apache/arrow-datafusion/issues/533


   As a user of DataFusion, I would like to be able to install custom parsing rules of SQL to DataFusion, so that I can plan custom nodes from SQL.
   
   This would allow me to extend datafusions' core capabilities beyond its supported SQL.
   
   Examples:
   * `OPTIMIZE`, `VACUUM`
   * `select * from t version as of n` (delta lake)
   
   I would like to support 2 main cases:
   
   * Parse entire SQL statements (e.g. `select * from t version as of n`) into a logical node
   * Parse single SQL expressions (e.g. `my_custom_expr` in `select my_custom_expr(t) from table1`) a custom logical expression
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] adsharma commented on issue #533: Add extension plugin to parse SQL into logical plan

Posted by GitBox <gi...@apache.org>.
adsharma commented on issue #533:
URL: https://github.com/apache/arrow-datafusion/issues/533#issuecomment-859977886


   > select * from t version as of n
   
   Is preventing a query such as the following desirable?
   
   ```
   select * from t1 version as of n1
   union  all
   select * from t2 version as of n2
   ```
   
   or is this by design?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] jorgecarleitao commented on issue #533: Add extension plugin to parse SQL into logical plan

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on issue #533:
URL: https://github.com/apache/arrow-datafusion/issues/533#issuecomment-865194671


   For avoidance of doubt, the goal of this issue is not to support these extension languages in DataFusion itself, but to allow users to plugin their own custom extensions, so that they can support them themselves.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] adsharma commented on issue #533: Add extension plugin to parse SQL into logical plan

Posted by GitBox <gi...@apache.org>.
adsharma commented on issue #533:
URL: https://github.com/apache/arrow-datafusion/issues/533#issuecomment-865192183


   Are the following good references for what delta lake is proposing?
   
   https://docs.databricks.com/delta/concurrency-control.html
   https://docs.databricks.com/delta/optimizations/isolation-level.html
   
   I get that delta lake uses transactions on the *metadata* to ensure that a
   consistent view of the table is presented to batch jobs that may be reading
   it.
   
   The reasoning I was using is that SQL is currently a unified query language
   that works for both OLTP and OLAP. In the ideal world, one set of
   extensions address both use cases.
   
   So what I understood from the discussion is - if there is an "events" table
   with a "timestamp" column and is partitioned by the hour, it's perfectly
   legit for a query to aggregate over hourly partitions to compute some sort
   of a view. No Tx isolation guarantee is violated.
   
   However, if a table is getting updated with new data and a query is able to
   see both the old version and the new version and compute stats using some
   mix of the two, delta lake isolation guarantees are violated (assuming the
   tables were set up with WriteSerializable isolation level)?
   
   On Sun, Jun 20, 2021 at 7:34 PM QP Hou ***@***.***> wrote:
   
   > as of n is a deltalake specific SQL extension. It's better to think of t
   > version as of n as a different table. Datafusion is not a transactional
   > query engine, so querying the same table should always return the same
   > result.
   >
   > —
   > You are receiving this because you were mentioned.
   > Reply to this email directly, view it on GitHub
   > <https://github.com/apache/arrow-datafusion/issues/533#issuecomment-864679870>,
   > or unsubscribe
   > <https://github.com/notifications/unsubscribe-auth/AAFA2A3CSAA2UQ763NQ7FXDTT2QLDANCNFSM46NO5YQQ>
   > .
   >
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] adsharma commented on issue #533: Add extension plugin to parse SQL into logical plan

Posted by GitBox <gi...@apache.org>.
adsharma commented on issue #533:
URL: https://github.com/apache/arrow-datafusion/issues/533#issuecomment-860999601


   I don't have much context about the proposal. Trying to understand things better. Please bear with me.
   
   The reason why SQL doesn't have `select * from t version as of n` is that it violates the isolation property in ACID, by allowing a query to read data from two different points in time (or logical sequence numbers if you prefer).
   
   Instead, they prefer a flow such as:
   
   ```
   BEGIN // implicitly selections a version and all queries following would read from that version
   select * from t1
   union all
   select * from t2;
   COMMIT
   ```
   
   I can also imagine a variant such as:
   
   ```
   BEGIN TRANSACTION @n1
   ...
   COMMIT
   ```
   
   which has the same effect. The benefit of these variants is that it makes it harder to write SQL that violates isolation properties.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] houqp edited a comment on issue #533: Add extension plugin to parse SQL into logical plan

Posted by GitBox <gi...@apache.org>.
houqp edited a comment on issue #533:
URL: https://github.com/apache/arrow-datafusion/issues/533#issuecomment-864679870


   `version as of n` is a deltalake specific SQL extension. It's better to think of `t version as of n` as a different table. Datafusion is not a transactional query engine, so querying the same table should always return the same result.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] houqp commented on issue #533: Add extension plugin to parse SQL into logical plan

Posted by GitBox <gi...@apache.org>.
houqp commented on issue #533:
URL: https://github.com/apache/arrow-datafusion/issues/533#issuecomment-860149933


   @adsharma I can't think of a reason why we would want to prevent the union query you mentioned. I don't think what @jorgecarleitao wrote in the issue description implies this?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] adsharma edited a comment on issue #533: Add extension plugin to parse SQL into logical plan

Posted by GitBox <gi...@apache.org>.
adsharma edited a comment on issue #533:
URL: https://github.com/apache/arrow-datafusion/issues/533#issuecomment-860999601


   I don't have much context about the proposal. Trying to understand things better. Please bear with me.
   
   The reason why SQL doesn't have `select * from t version as of n` is that it violates the isolation property in ACID, by allowing a query to read data from two different points in time (or logical sequence numbers if you prefer).
   
   Instead, they prefer a flow such as:
   
   ```
   BEGIN // implicitly selects a version and all queries following would read from that version
   select * from t1
   union all
   select * from t2;
   COMMIT
   ```
   
   I can also imagine a variant such as:
   
   ```
   BEGIN TRANSACTION @n1
   ...
   COMMIT
   ```
   
   which has the same effect. The benefit of these variants is that it makes it harder to write SQL that violates isolation properties.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] jorgecarleitao commented on issue #533: Add extension plugin to parse SQL into logical plan

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on issue #533:
URL: https://github.com/apache/arrow-datafusion/issues/533#issuecomment-865194671


   For avoidance of doubt, the goal of this issue is not to support these extension languages in DataFusion itself, but to allow users to plugin their own custom extensions, so that they can support them themselves.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] jorgecarleitao commented on issue #533: Add extension plugin to parse SQL into logical plan

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on issue #533:
URL: https://github.com/apache/arrow-datafusion/issues/533#issuecomment-860156321


   I also do not see the issue with the example above, but I would say that, In general, custom SQL parsers effectively modify the SQL dialect that is being used and therefore the responsibility to document variations, including any limitation that they may introduce to the "default" postgres dialect, lays to the applications that use/install custom parsers.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] adsharma commented on issue #533: Add extension plugin to parse SQL into logical plan

Posted by GitBox <gi...@apache.org>.
adsharma commented on issue #533:
URL: https://github.com/apache/arrow-datafusion/issues/533#issuecomment-865192183


   Are the following good references for what delta lake is proposing?
   
   https://docs.databricks.com/delta/concurrency-control.html
   https://docs.databricks.com/delta/optimizations/isolation-level.html
   
   I get that delta lake uses transactions on the *metadata* to ensure that a
   consistent view of the table is presented to batch jobs that may be reading
   it.
   
   The reasoning I was using is that SQL is currently a unified query language
   that works for both OLTP and OLAP. In the ideal world, one set of
   extensions address both use cases.
   
   So what I understood from the discussion is - if there is an "events" table
   with a "timestamp" column and is partitioned by the hour, it's perfectly
   legit for a query to aggregate over hourly partitions to compute some sort
   of a view. No Tx isolation guarantee is violated.
   
   However, if a table is getting updated with new data and a query is able to
   see both the old version and the new version and compute stats using some
   mix of the two, delta lake isolation guarantees are violated (assuming the
   tables were set up with WriteSerializable isolation level)?
   
   On Sun, Jun 20, 2021 at 7:34 PM QP Hou ***@***.***> wrote:
   
   > as of n is a deltalake specific SQL extension. It's better to think of t
   > version as of n as a different table. Datafusion is not a transactional
   > query engine, so querying the same table should always return the same
   > result.
   >
   > —
   > You are receiving this because you were mentioned.
   > Reply to this email directly, view it on GitHub
   > <https://github.com/apache/arrow-datafusion/issues/533#issuecomment-864679870>,
   > or unsubscribe
   > <https://github.com/notifications/unsubscribe-auth/AAFA2A3CSAA2UQ763NQ7FXDTT2QLDANCNFSM46NO5YQQ>
   > .
   >
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] houqp commented on issue #533: Add extension plugin to parse SQL into logical plan

Posted by GitBox <gi...@apache.org>.
houqp commented on issue #533:
URL: https://github.com/apache/arrow-datafusion/issues/533#issuecomment-864679870


   `as of n` is a deltalake specific SQL extension. It's better to think of `t version as of n` as a different table. Datafusion is not a transactional query engine, so querying the same table should always return the same result.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org