You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Peter Vary <pv...@cloudera.com> on 2020/10/22 07:49:16 UTC

Hive SQL extension

Hi Hive experts,

I would like to extend Hive SQL language to provide a way to create Iceberg partitioned tables like this:
create table iceberg_test(
        level string,
        event_time timestamp,
        message string,
        register_time date,
        telephone array <string>
    )
    partition by spec(
        level identity,
        event_time identity,
        event_time hour,
        register_time day
    )
    stored as iceberg;

The problem is that this syntax is very specific of Iceberg, and I think it is not a good idea to change the Hive syntax globally to accommodate a specific use-case.
The following CREATE TABLE statement could archive the same thing:
create table iceberg_test(
        level string,
        event_time timestamp,
        message string,
        register_time date,
        telephone array <string>
    )
    STORED BY 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler' 
    TBLPROPERTIES ('iceberg.mr.table.partition.spec'='...');

I am looking for a way to rewrite the original (Hive syntactically not correct) query to a new (syntactically correct) one.

I was checking the hooks as a possible solution, but I have found that:
HiveDriverRunHook.preDriverRun can get the original / syntactically not correct query, but I have found no way to rewrite it to a syntactically correct one (it looks like a read only query)
HiveSemanticAnalyzerHook can rewrite the AST tree, but it needs a syntactically correct query to start with

Any other ideas how to archive the goals above? Either with Hooks, or with any other way?

Thanks,
Peter

Re: Hive SQL extension

Posted by Peter Vary <pv...@cloudera.com.INVALID>.

Hi Jesus, Stamatis

Thanks for taking the time and answering my questions!

Here is some info I have started from:

Iceberg partitioning is defined here:
https://iceberg.apache.org/spec/#partitioning <https://iceberg.apache.org/spec/#partitioning>

Other examples for SQL extension proposal from the Iceberg world:
https://docs.google.com/document/d/1Nf8c16R2hj4lSc-4sQg4oiUUV_F4XqZKth1woEo6TN8/edit#heading=h.v8gsu2fe19q2 <https://docs.google.com/document/d/1Nf8c16R2hj4lSc-4sQg4oiUUV_F4XqZKth1woEo6TN8/edit#heading=h.v8gsu2fe19q2>


Impala uses this syntax:
https://issues.apache.org/jira/browse/IMPALA-9688 <https://issues.apache.org/jira/browse/IMPALA-9688>
https://issues.apache.org/jira/browse/IMPALA-10165 <https://issues.apache.org/jira/browse/IMPALA-10165>

I agree with the Iceberg folks, that it would be good to synchronize the syntax across different engines.


As for the virtual column proposal:
Would it be ok to create a syntax in Hive which does not have a working implementation in Hive?
Currently the IcebergSerDe / IcebergStorageHandler and friends are in the Iceberg repo.
Even if we move the storage handler part from the Iceberg repo to the Hive repo, Iceberg depends HMS for HiveCatalog to store the table snapshots in a HMS instance
Having Hive depending on Iceberg would cause circular dependency which is always a problem
Could we find some similar solutions for the other proposed SQL extensions as well?

I am really torn on the topic of where and how to parse these syntactic sugars. My thoughts:
If we parse in the Hooks - then the hooks itself will be brittle, and hugely dependent on the original Hive syntax
If we add everything to the big Hive parser - then we will either bloat the Hive parser (even more) with rarely used syntax, or we restrict openness / extensibility

I was leaning towards parsing in the Hooks since that brittleness affects only the users of the Hooks and can be brought into Hive if we decide that it could be generally useful, and if the audience remains narrow then it will not affect the other Hive users.

Thanks,
Peter



> On Oct 27, 2020, at 15:03, Jesus Camacho Rodriguez <jc...@apache.org> wrote:
> 
> Hi Peter,
> 
> Thanks for bringing this up.
> 
> Why are targeting the 'partition by spec' syntax? Is it for convenience?
> Was it already introduced by Iceberg?
> 
> I did not understand the reasoning for not introducing the new syntax in
> Hive. As it was already mentioned by Stamatis, there is some advantage to
> doing syntax validation through the usual flow.
> If the question is whether we could make it useful beyond Iceberg format
> and there is no attachment to the syntax above, could this be generalized
> to introducing generated virtual column declaration in Hive (many RDBMSs
> support these) + using the current partitioning declaration? For instance,
> for your DDL declaration above:
> 
> create table iceberg_test(
>        level string,
>        event_time timestamp,
>        message string,
>        register_time date,
>        telephone array <string>
>    )
>    partitioned by (
>        v_level [GENERATED ALWAYS] AS level,
>        v_event_time [GENERATED ALWAYS] AS event_time,
>        v_event_time_hour [GENERATED ALWAYS] AS hour(event_time),
>        v_register_time [GENERATED ALWAYS] AS day(register_time)
>    )
>    stored as iceberg;
> 
> This would assume that the underlying storage format supports partitioning
> by virtual columns. I believe this syntax would allow us to take some of
> these ideas even further, e.g., introduce 'stored' derived columns or
> custom partitioning specs (although without the underlying storage format
> support, probably they would need to be 'stored' instead of 'virtual').
> 
> Even if you introduce this syntax, you could still do the transformation
> that you described above internally, i.e., storage handler resolution and
> table properties generation. Thus, the internal handling would be the same.
> 
> Thanks,
> Jesús
> 
> On Mon, Oct 26, 2020 at 2:27 AM Stamatis Zampetakis <zabetak@gmail.com <ma...@gmail.com>>
> wrote:
> 
>> I do like extensions and things that simplify our life when
>> writing queries.
>> 
>> Regarding the partitioning syntax for Iceberg, there may be better
>> alternatives.
>> I was also leaning towards a syntax like the one proposed by Jesus (in
>> another thread) based on virtual columns, which is also part of SQL
>> standard.
>> 
>> Regarding the other use cases mentioned (temporal queries, time travel
>> etc.) there are things that are part of SQL standard so we could start from
>> there and then introduce extensions if needed.
>> 
>> Syntactic sugar is powerful but in terms of design I find it more
>> appropriate to perform "desugaring" after having an AST; either AST to AST
>> transformations or afterwards.
>> The syntax (sugared or not) is part and responsibility of the parser so an
>> architecture with sub-parser hooks seems a bit brittle, especially if we
>> start using it extensively.
>> Having said that you have thought of this much more than I did so maybe
>> the hook's approach is a better idea after all :)
>> 
>> Best,
>> Stamatis
>> 
>> On Fri, Oct 23, 2020 at 2:26 PM Pau Tallada <ta...@pic.es> wrote:
>> 
>>> Hi all,
>>> 
>>> I do not know if that may be of interest to you, but there are other
>>> projects that could benefit from this.
>>> For instance, ADQL
>>> <https://www.ivoa.net/documents/ADQL/20180112/PR-ADQL-2.1-20180112.html <https://www.ivoa.net/documents/ADQL/20180112/PR-ADQL-2.1-20180112.html>>
>>> (Astronomical Data Query Language) is a SQL-like language that defines some
>>> higher-level functions that enable powerful geospatial queries. Projects
>>> like queryparser <https://github.com/aipescience/queryparser <https://github.com/aipescience/queryparser>> are able
>>> to translate from ADQL to vendor-SQL for MySQL or PostreSQL. In this case,
>>> the syntactic sugar is implemented as an external layer on top, but could
>>> very well be implemented in a rewrite hook if available.
>>> 
>>> Cheers,
>>> 
>>> Pau.
>>> 
>>> Missatge de Peter Vary <pv...@cloudera.com> del dia dj., 22 d’oct. 2020
>>> a les 16:21:
>>> 
>>>> 
>>>> Let's assume that this feature would be useful for Iceberg tables, but
>>>> useless and even problematic/forbidden for other tables. :)
>>>> 
>>>> My thinking is, that it could make Hive much more user friendly, if we
>>>> would allow for extensions in language.
>>>> 
>>>> With Iceberg integration we plan to do several extensions which might
>>>> not be useful for other tables. Some examples:
>>>> 
>>>>   - When creating tables we want to send additional information to the
>>>>   storage layer, and pushing everything in properties is a pain (not really
>>>>   user friendly)
>>>>   - We would like to allow querying table history for iceberg tables
>>>>   (previous snapshotId-s, timestamps, etc)
>>>>   - We would like to allow time travel for iceberg tables based on the
>>>>   data queried above
>>>>   - We would like to allow the user to see / manage / remove old
>>>>   snapshots
>>>> 
>>>> 
>>>> These are all very specific Iceberg related stuff, and most probably
>>>> will not work / useful for any other type of the tables, so I think adding
>>>> them to Hive parser would be a stretch.
>>>> 
>>>> On the other hand if we do not provide SQL interface for accessing these
>>>> features then the users will turn to Spark/Impala/Presto to be able to work
>>>> with Iceberg tables.
>>>> 
>>>> As for your specific question for handling syntax errors (I have just
>>>> started to think about how would I do it, so feel free to suggest better
>>>> methods):
>>>> 
>>>>   - Let's assume that we have a hook which can get the sql command as
>>>>   an input and can rewrite it to a new SQL command
>>>>   - I would write simplified parser which tries to be as simple as
>>>>   possible for the specific command
>>>>   - Based on the parsing I would return the same command / throw an
>>>>   exception / rewrite the command
>>>> 
>>>> 
>>>> Admittedly this solution is working only if we can make every feature
>>>> work without changing other part of Hive, and we just want to add
>>>> "syntactic sugar" to it. (Do not underestimate the benefits of syntactic
>>>> sugar :))
>>>> 
>>>> Thanks,
>>>> Peter
>>>> 
>>>> 
>>>> On Oct 22, 2020, at 11:44, Stamatis Zampetakis <za...@gmail.com>
>>>> wrote:
>>>> 
>>>> Hi Peter,
>>>> 
>>>> I am nowhere near being an expert but just wanted to share my thoughts.
>>>> 
>>>> If I understand correctly you would like some syntactic sugar in Hive to
>>>> support partitioning as per Iceberg. I cannot tell if that's really useful
>>>> or not but from my point of view it doesn't seem a very good idea to
>>>> introduce another layer of parsing before the actual parser (don't know if
>>>> there is one already). For instance, how are you gonna handle the situation
>>>> where there are syntax errors in your sugared part and what the end user
>>>> should see?
>>>> 
>>>> No matter how it is added if you give the possibility to the user to
>>>> write such queries it becomes part of the Hive syntax and as such a job of
>>>> the parser.
>>>> 
>>>> Best,
>>>> Stamatis
>>>> 
>>>> 
>>>> On Thu, Oct 22, 2020 at 9:49 AM Peter Vary <pv...@cloudera.com> wrote:
>>>> 
>>>>> Hi Hive experts,
>>>>> 
>>>>> I would like to extend Hive SQL language to provide a way to create
>>>>> Iceberg partitioned tables like this:
>>>>> 
>>>>> create table iceberg_test(
>>>>>        level string,
>>>>>        event_time timestamp,
>>>>>        message string,
>>>>>        register_time date,
>>>>>        telephone array <string>
>>>>>    )
>>>>>    partition by spec(
>>>>>        level identity,
>>>>>        event_time identity,
>>>>>        event_time hour,
>>>>>        register_time day
>>>>>    )
>>>>>    stored as iceberg;
>>>>> 
>>>>> 
>>>>> The problem is that this syntax is very specific of Iceberg, and I
>>>>> think it is not a good idea to change the Hive syntax globally to
>>>>> accommodate a specific use-case.
>>>>> The following CREATE TABLE statement could archive the same thing:
>>>>> 
>>>>> create table iceberg_test(
>>>>>        level string,
>>>>>        event_time timestamp,
>>>>>        message string,
>>>>>        register_time date,
>>>>>        telephone array <string>
>>>>>    )
>>>>>    STORED BY 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler'
>>>>>    TBLPROPERTIES ('iceberg.mr.table.partition.spec'='...');
>>>>> 
>>>>> 
>>>>> I am looking for a way to rewrite the original (Hive syntactically not
>>>>> correct) query to a new (syntactically correct) one.
>>>>> 
>>>>> I was checking the hooks as a possible solution, but I have found that:
>>>>> 
>>>>>   - HiveDriverRunHook.preDriverRun can get the original /
>>>>>   syntactically not correct query, but I have found no way to rewrite it to a
>>>>>   syntactically correct one (it looks like a read only query)
>>>>>   - HiveSemanticAnalyzerHook can rewrite the AST tree, but it needs a
>>>>>   syntactically correct query to start with
>>>>> 
>>>>> 
>>>>> Any other ideas how to archive the goals above? Either with Hooks, or
>>>>> with any other way?
>>>>> 
>>>>> Thanks,
>>>>> Peter
>>>>> 
>>>> 
>>>> 
>>> 
>>> --
>>> ----------------------------------
>>> Pau Tallada Crespí
>>> Departament de Serveis
>>> Port d'Informació Científica (PIC)
>>> Tel: +34 93 170 2729
>>> ----------------------------------

Re: Hive SQL extension

Posted by Peter Vary <pv...@cloudera.com>.

Hi Jesus, Stamatis

Thanks for taking the time and answering my questions!

Here is some info I have started from:

Iceberg partitioning is defined here:
https://iceberg.apache.org/spec/#partitioning <https://iceberg.apache.org/spec/#partitioning>

Other examples for SQL extension proposal from the Iceberg world:
https://docs.google.com/document/d/1Nf8c16R2hj4lSc-4sQg4oiUUV_F4XqZKth1woEo6TN8/edit#heading=h.v8gsu2fe19q2 <https://docs.google.com/document/d/1Nf8c16R2hj4lSc-4sQg4oiUUV_F4XqZKth1woEo6TN8/edit#heading=h.v8gsu2fe19q2>


Impala uses this syntax:
https://issues.apache.org/jira/browse/IMPALA-9688 <https://issues.apache.org/jira/browse/IMPALA-9688>
https://issues.apache.org/jira/browse/IMPALA-10165 <https://issues.apache.org/jira/browse/IMPALA-10165>

I agree with the Iceberg folks, that it would be good to synchronize the syntax across different engines.


As for the virtual column proposal:
Would it be ok to create a syntax in Hive which does not have a working implementation in Hive?
Currently the IcebergSerDe / IcebergStorageHandler and friends are in the Iceberg repo.
Even if we move the storage handler part from the Iceberg repo to the Hive repo, Iceberg depends HMS for HiveCatalog to store the table snapshots in a HMS instance
Having Hive depending on Iceberg would cause circular dependency which is always a problem
Could we find some similar solutions for the other proposed SQL extensions as well?

I am really torn on the topic of where and how to parse these syntactic sugars. My thoughts:
If we parse in the Hooks - then the hooks itself will be brittle, and hugely dependent on the original Hive syntax
If we add everything to the big Hive parser - then we will either bloat the Hive parser (even more) with rarely used syntax, or we restrict openness / extensibility

I was leaning towards parsing in the Hooks since that brittleness affects only the users of the Hooks and can be brought into Hive if we decide that it could be generally useful, and if the audience remains narrow then it will not affect the other Hive users.

Thanks,
Peter



> On Oct 27, 2020, at 15:03, Jesus Camacho Rodriguez <jc...@apache.org> wrote:
> 
> Hi Peter,
> 
> Thanks for bringing this up.
> 
> Why are targeting the 'partition by spec' syntax? Is it for convenience?
> Was it already introduced by Iceberg?
> 
> I did not understand the reasoning for not introducing the new syntax in
> Hive. As it was already mentioned by Stamatis, there is some advantage to
> doing syntax validation through the usual flow.
> If the question is whether we could make it useful beyond Iceberg format
> and there is no attachment to the syntax above, could this be generalized
> to introducing generated virtual column declaration in Hive (many RDBMSs
> support these) + using the current partitioning declaration? For instance,
> for your DDL declaration above:
> 
> create table iceberg_test(
>        level string,
>        event_time timestamp,
>        message string,
>        register_time date,
>        telephone array <string>
>    )
>    partitioned by (
>        v_level [GENERATED ALWAYS] AS level,
>        v_event_time [GENERATED ALWAYS] AS event_time,
>        v_event_time_hour [GENERATED ALWAYS] AS hour(event_time),
>        v_register_time [GENERATED ALWAYS] AS day(register_time)
>    )
>    stored as iceberg;
> 
> This would assume that the underlying storage format supports partitioning
> by virtual columns. I believe this syntax would allow us to take some of
> these ideas even further, e.g., introduce 'stored' derived columns or
> custom partitioning specs (although without the underlying storage format
> support, probably they would need to be 'stored' instead of 'virtual').
> 
> Even if you introduce this syntax, you could still do the transformation
> that you described above internally, i.e., storage handler resolution and
> table properties generation. Thus, the internal handling would be the same.
> 
> Thanks,
> Jesús
> 
> On Mon, Oct 26, 2020 at 2:27 AM Stamatis Zampetakis <zabetak@gmail.com <ma...@gmail.com>>
> wrote:
> 
>> I do like extensions and things that simplify our life when
>> writing queries.
>> 
>> Regarding the partitioning syntax for Iceberg, there may be better
>> alternatives.
>> I was also leaning towards a syntax like the one proposed by Jesus (in
>> another thread) based on virtual columns, which is also part of SQL
>> standard.
>> 
>> Regarding the other use cases mentioned (temporal queries, time travel
>> etc.) there are things that are part of SQL standard so we could start from
>> there and then introduce extensions if needed.
>> 
>> Syntactic sugar is powerful but in terms of design I find it more
>> appropriate to perform "desugaring" after having an AST; either AST to AST
>> transformations or afterwards.
>> The syntax (sugared or not) is part and responsibility of the parser so an
>> architecture with sub-parser hooks seems a bit brittle, especially if we
>> start using it extensively.
>> Having said that you have thought of this much more than I did so maybe
>> the hook's approach is a better idea after all :)
>> 
>> Best,
>> Stamatis
>> 
>> On Fri, Oct 23, 2020 at 2:26 PM Pau Tallada <ta...@pic.es> wrote:
>> 
>>> Hi all,
>>> 
>>> I do not know if that may be of interest to you, but there are other
>>> projects that could benefit from this.
>>> For instance, ADQL
>>> <https://www.ivoa.net/documents/ADQL/20180112/PR-ADQL-2.1-20180112.html <https://www.ivoa.net/documents/ADQL/20180112/PR-ADQL-2.1-20180112.html>>
>>> (Astronomical Data Query Language) is a SQL-like language that defines some
>>> higher-level functions that enable powerful geospatial queries. Projects
>>> like queryparser <https://github.com/aipescience/queryparser <https://github.com/aipescience/queryparser>> are able
>>> to translate from ADQL to vendor-SQL for MySQL or PostreSQL. In this case,
>>> the syntactic sugar is implemented as an external layer on top, but could
>>> very well be implemented in a rewrite hook if available.
>>> 
>>> Cheers,
>>> 
>>> Pau.
>>> 
>>> Missatge de Peter Vary <pv...@cloudera.com> del dia dj., 22 d’oct. 2020
>>> a les 16:21:
>>> 
>>>> 
>>>> Let's assume that this feature would be useful for Iceberg tables, but
>>>> useless and even problematic/forbidden for other tables. :)
>>>> 
>>>> My thinking is, that it could make Hive much more user friendly, if we
>>>> would allow for extensions in language.
>>>> 
>>>> With Iceberg integration we plan to do several extensions which might
>>>> not be useful for other tables. Some examples:
>>>> 
>>>>   - When creating tables we want to send additional information to the
>>>>   storage layer, and pushing everything in properties is a pain (not really
>>>>   user friendly)
>>>>   - We would like to allow querying table history for iceberg tables
>>>>   (previous snapshotId-s, timestamps, etc)
>>>>   - We would like to allow time travel for iceberg tables based on the
>>>>   data queried above
>>>>   - We would like to allow the user to see / manage / remove old
>>>>   snapshots
>>>> 
>>>> 
>>>> These are all very specific Iceberg related stuff, and most probably
>>>> will not work / useful for any other type of the tables, so I think adding
>>>> them to Hive parser would be a stretch.
>>>> 
>>>> On the other hand if we do not provide SQL interface for accessing these
>>>> features then the users will turn to Spark/Impala/Presto to be able to work
>>>> with Iceberg tables.
>>>> 
>>>> As for your specific question for handling syntax errors (I have just
>>>> started to think about how would I do it, so feel free to suggest better
>>>> methods):
>>>> 
>>>>   - Let's assume that we have a hook which can get the sql command as
>>>>   an input and can rewrite it to a new SQL command
>>>>   - I would write simplified parser which tries to be as simple as
>>>>   possible for the specific command
>>>>   - Based on the parsing I would return the same command / throw an
>>>>   exception / rewrite the command
>>>> 
>>>> 
>>>> Admittedly this solution is working only if we can make every feature
>>>> work without changing other part of Hive, and we just want to add
>>>> "syntactic sugar" to it. (Do not underestimate the benefits of syntactic
>>>> sugar :))
>>>> 
>>>> Thanks,
>>>> Peter
>>>> 
>>>> 
>>>> On Oct 22, 2020, at 11:44, Stamatis Zampetakis <za...@gmail.com>
>>>> wrote:
>>>> 
>>>> Hi Peter,
>>>> 
>>>> I am nowhere near being an expert but just wanted to share my thoughts.
>>>> 
>>>> If I understand correctly you would like some syntactic sugar in Hive to
>>>> support partitioning as per Iceberg. I cannot tell if that's really useful
>>>> or not but from my point of view it doesn't seem a very good idea to
>>>> introduce another layer of parsing before the actual parser (don't know if
>>>> there is one already). For instance, how are you gonna handle the situation
>>>> where there are syntax errors in your sugared part and what the end user
>>>> should see?
>>>> 
>>>> No matter how it is added if you give the possibility to the user to
>>>> write such queries it becomes part of the Hive syntax and as such a job of
>>>> the parser.
>>>> 
>>>> Best,
>>>> Stamatis
>>>> 
>>>> 
>>>> On Thu, Oct 22, 2020 at 9:49 AM Peter Vary <pv...@cloudera.com> wrote:
>>>> 
>>>>> Hi Hive experts,
>>>>> 
>>>>> I would like to extend Hive SQL language to provide a way to create
>>>>> Iceberg partitioned tables like this:
>>>>> 
>>>>> create table iceberg_test(
>>>>>        level string,
>>>>>        event_time timestamp,
>>>>>        message string,
>>>>>        register_time date,
>>>>>        telephone array <string>
>>>>>    )
>>>>>    partition by spec(
>>>>>        level identity,
>>>>>        event_time identity,
>>>>>        event_time hour,
>>>>>        register_time day
>>>>>    )
>>>>>    stored as iceberg;
>>>>> 
>>>>> 
>>>>> The problem is that this syntax is very specific of Iceberg, and I
>>>>> think it is not a good idea to change the Hive syntax globally to
>>>>> accommodate a specific use-case.
>>>>> The following CREATE TABLE statement could archive the same thing:
>>>>> 
>>>>> create table iceberg_test(
>>>>>        level string,
>>>>>        event_time timestamp,
>>>>>        message string,
>>>>>        register_time date,
>>>>>        telephone array <string>
>>>>>    )
>>>>>    STORED BY 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler'
>>>>>    TBLPROPERTIES ('iceberg.mr.table.partition.spec'='...');
>>>>> 
>>>>> 
>>>>> I am looking for a way to rewrite the original (Hive syntactically not
>>>>> correct) query to a new (syntactically correct) one.
>>>>> 
>>>>> I was checking the hooks as a possible solution, but I have found that:
>>>>> 
>>>>>   - HiveDriverRunHook.preDriverRun can get the original /
>>>>>   syntactically not correct query, but I have found no way to rewrite it to a
>>>>>   syntactically correct one (it looks like a read only query)
>>>>>   - HiveSemanticAnalyzerHook can rewrite the AST tree, but it needs a
>>>>>   syntactically correct query to start with
>>>>> 
>>>>> 
>>>>> Any other ideas how to archive the goals above? Either with Hooks, or
>>>>> with any other way?
>>>>> 
>>>>> Thanks,
>>>>> Peter
>>>>> 
>>>> 
>>>> 
>>> 
>>> --
>>> ----------------------------------
>>> Pau Tallada Crespí
>>> Departament de Serveis
>>> Port d'Informació Científica (PIC)
>>> Tel: +34 93 170 2729
>>> ----------------------------------

Re: Hive SQL extension

Posted by Jesus Camacho Rodriguez <jc...@apache.org>.

Hi Peter,

Thanks for bringing this up.

Why are targeting the 'partition by spec' syntax? Is it for convenience?
Was it already introduced by Iceberg?

I did not understand the reasoning for not introducing the new syntax in
Hive. As it was already mentioned by Stamatis, there is some advantage to
doing syntax validation through the usual flow.
If the question is whether we could make it useful beyond Iceberg format
and there is no attachment to the syntax above, could this be generalized
to introducing generated virtual column declaration in Hive (many RDBMSs
support these) + using the current partitioning declaration? For instance,
for your DDL declaration above:

create table iceberg_test(
        level string,
        event_time timestamp,
        message string,
        register_time date,
        telephone array <string>
    )
    partitioned by (
        v_level [GENERATED ALWAYS] AS level,
        v_event_time [GENERATED ALWAYS] AS event_time,
        v_event_time_hour [GENERATED ALWAYS] AS hour(event_time),
        v_register_time [GENERATED ALWAYS] AS day(register_time)
    )
    stored as iceberg;

This would assume that the underlying storage format supports partitioning
by virtual columns. I believe this syntax would allow us to take some of
these ideas even further, e.g., introduce 'stored' derived columns or
custom partitioning specs (although without the underlying storage format
support, probably they would need to be 'stored' instead of 'virtual').

Even if you introduce this syntax, you could still do the transformation
that you described above internally, i.e., storage handler resolution and
table properties generation. Thus, the internal handling would be the same.

Thanks,
Jesús

On Mon, Oct 26, 2020 at 2:27 AM Stamatis Zampetakis <za...@gmail.com>
wrote:

> I do like extensions and things that simplify our life when
> writing queries.
>
> Regarding the partitioning syntax for Iceberg, there may be better
> alternatives.
> I was also leaning towards a syntax like the one proposed by Jesus (in
> another thread) based on virtual columns, which is also part of SQL
> standard.
>
> Regarding the other use cases mentioned (temporal queries, time travel
> etc.) there are things that are part of SQL standard so we could start from
> there and then introduce extensions if needed.
>
> Syntactic sugar is powerful but in terms of design I find it more
> appropriate to perform "desugaring" after having an AST; either AST to AST
> transformations or afterwards.
> The syntax (sugared or not) is part and responsibility of the parser so an
> architecture with sub-parser hooks seems a bit brittle, especially if we
> start using it extensively.
> Having said that you have thought of this much more than I did so maybe
> the hook's approach is a better idea after all :)
>
> Best,
> Stamatis
>
> On Fri, Oct 23, 2020 at 2:26 PM Pau Tallada <ta...@pic.es> wrote:
>
>> Hi all,
>>
>> I do not know if that may be of interest to you, but there are other
>> projects that could benefit from this.
>> For instance, ADQL
>> <https://www.ivoa.net/documents/ADQL/20180112/PR-ADQL-2.1-20180112.html>
>> (Astronomical Data Query Language) is a SQL-like language that defines some
>> higher-level functions that enable powerful geospatial queries. Projects
>> like queryparser <https://github.com/aipescience/queryparser> are able
>> to translate from ADQL to vendor-SQL for MySQL or PostreSQL. In this case,
>> the syntactic sugar is implemented as an external layer on top, but could
>> very well be implemented in a rewrite hook if available.
>>
>> Cheers,
>>
>> Pau.
>>
>> Missatge de Peter Vary <pv...@cloudera.com> del dia dj., 22 d’oct. 2020
>> a les 16:21:
>>
>>>
>>> Let's assume that this feature would be useful for Iceberg tables, but
>>> useless and even problematic/forbidden for other tables. :)
>>>
>>> My thinking is, that it could make Hive much more user friendly, if we
>>> would allow for extensions in language.
>>>
>>> With Iceberg integration we plan to do several extensions which might
>>> not be useful for other tables. Some examples:
>>>
>>>    - When creating tables we want to send additional information to the
>>>    storage layer, and pushing everything in properties is a pain (not really
>>>    user friendly)
>>>    - We would like to allow querying table history for iceberg tables
>>>    (previous snapshotId-s, timestamps, etc)
>>>    - We would like to allow time travel for iceberg tables based on the
>>>    data queried above
>>>    - We would like to allow the user to see / manage / remove old
>>>    snapshots
>>>
>>>
>>> These are all very specific Iceberg related stuff, and most probably
>>> will not work / useful for any other type of the tables, so I think adding
>>> them to Hive parser would be a stretch.
>>>
>>> On the other hand if we do not provide SQL interface for accessing these
>>> features then the users will turn to Spark/Impala/Presto to be able to work
>>> with Iceberg tables.
>>>
>>> As for your specific question for handling syntax errors (I have just
>>> started to think about how would I do it, so feel free to suggest better
>>> methods):
>>>
>>>    - Let's assume that we have a hook which can get the sql command as
>>>    an input and can rewrite it to a new SQL command
>>>    - I would write simplified parser which tries to be as simple as
>>>    possible for the specific command
>>>    - Based on the parsing I would return the same command / throw an
>>>    exception / rewrite the command
>>>
>>>
>>> Admittedly this solution is working only if we can make every feature
>>> work without changing other part of Hive, and we just want to add
>>> "syntactic sugar" to it. (Do not underestimate the benefits of syntactic
>>> sugar :))
>>>
>>> Thanks,
>>> Peter
>>>
>>>
>>> On Oct 22, 2020, at 11:44, Stamatis Zampetakis <za...@gmail.com>
>>> wrote:
>>>
>>> Hi Peter,
>>>
>>> I am nowhere near being an expert but just wanted to share my thoughts.
>>>
>>> If I understand correctly you would like some syntactic sugar in Hive to
>>> support partitioning as per Iceberg. I cannot tell if that's really useful
>>> or not but from my point of view it doesn't seem a very good idea to
>>> introduce another layer of parsing before the actual parser (don't know if
>>> there is one already). For instance, how are you gonna handle the situation
>>> where there are syntax errors in your sugared part and what the end user
>>> should see?
>>>
>>> No matter how it is added if you give the possibility to the user to
>>> write such queries it becomes part of the Hive syntax and as such a job of
>>> the parser.
>>>
>>> Best,
>>> Stamatis
>>>
>>>
>>> On Thu, Oct 22, 2020 at 9:49 AM Peter Vary <pv...@cloudera.com> wrote:
>>>
>>>> Hi Hive experts,
>>>>
>>>> I would like to extend Hive SQL language to provide a way to create
>>>> Iceberg partitioned tables like this:
>>>>
>>>> create table iceberg_test(
>>>>         level string,
>>>>         event_time timestamp,
>>>>         message string,
>>>>         register_time date,
>>>>         telephone array <string>
>>>>     )
>>>>     partition by spec(
>>>>         level identity,
>>>>         event_time identity,
>>>>         event_time hour,
>>>>         register_time day
>>>>     )
>>>>     stored as iceberg;
>>>>
>>>>
>>>> The problem is that this syntax is very specific of Iceberg, and I
>>>> think it is not a good idea to change the Hive syntax globally to
>>>> accommodate a specific use-case.
>>>> The following CREATE TABLE statement could archive the same thing:
>>>>
>>>> create table iceberg_test(
>>>>         level string,
>>>>         event_time timestamp,
>>>>         message string,
>>>>         register_time date,
>>>>         telephone array <string>
>>>>     )
>>>>     STORED BY 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler'
>>>>     TBLPROPERTIES ('iceberg.mr.table.partition.spec'='...');
>>>>
>>>>
>>>> I am looking for a way to rewrite the original (Hive syntactically not
>>>> correct) query to a new (syntactically correct) one.
>>>>
>>>> I was checking the hooks as a possible solution, but I have found that:
>>>>
>>>>    - HiveDriverRunHook.preDriverRun can get the original /
>>>>    syntactically not correct query, but I have found no way to rewrite it to a
>>>>    syntactically correct one (it looks like a read only query)
>>>>    - HiveSemanticAnalyzerHook can rewrite the AST tree, but it needs a
>>>>    syntactically correct query to start with
>>>>
>>>>
>>>> Any other ideas how to archive the goals above? Either with Hooks, or
>>>> with any other way?
>>>>
>>>> Thanks,
>>>> Peter
>>>>
>>>
>>>
>>
>> --
>> ----------------------------------
>> Pau Tallada Crespí
>> Departament de Serveis
>> Port d'Informació Científica (PIC)
>> Tel: +34 93 170 2729
>> ----------------------------------
>>
>>

Re: Hive SQL extension

Posted by Jesus Camacho Rodriguez <jc...@apache.org>.

Hi Peter,

Thanks for bringing this up.

Why are targeting the 'partition by spec' syntax? Is it for convenience?
Was it already introduced by Iceberg?

I did not understand the reasoning for not introducing the new syntax in
Hive. As it was already mentioned by Stamatis, there is some advantage to
doing syntax validation through the usual flow.
If the question is whether we could make it useful beyond Iceberg format
and there is no attachment to the syntax above, could this be generalized
to introducing generated virtual column declaration in Hive (many RDBMSs
support these) + using the current partitioning declaration? For instance,
for your DDL declaration above:

create table iceberg_test(
        level string,
        event_time timestamp,
        message string,
        register_time date,
        telephone array <string>
    )
    partitioned by (
        v_level [GENERATED ALWAYS] AS level,
        v_event_time [GENERATED ALWAYS] AS event_time,
        v_event_time_hour [GENERATED ALWAYS] AS hour(event_time),
        v_register_time [GENERATED ALWAYS] AS day(register_time)
    )
    stored as iceberg;

This would assume that the underlying storage format supports partitioning
by virtual columns. I believe this syntax would allow us to take some of
these ideas even further, e.g., introduce 'stored' derived columns or
custom partitioning specs (although without the underlying storage format
support, probably they would need to be 'stored' instead of 'virtual').

Even if you introduce this syntax, you could still do the transformation
that you described above internally, i.e., storage handler resolution and
table properties generation. Thus, the internal handling would be the same.

Thanks,
Jesús

On Mon, Oct 26, 2020 at 2:27 AM Stamatis Zampetakis <za...@gmail.com>
wrote:

> I do like extensions and things that simplify our life when
> writing queries.
>
> Regarding the partitioning syntax for Iceberg, there may be better
> alternatives.
> I was also leaning towards a syntax like the one proposed by Jesus (in
> another thread) based on virtual columns, which is also part of SQL
> standard.
>
> Regarding the other use cases mentioned (temporal queries, time travel
> etc.) there are things that are part of SQL standard so we could start from
> there and then introduce extensions if needed.
>
> Syntactic sugar is powerful but in terms of design I find it more
> appropriate to perform "desugaring" after having an AST; either AST to AST
> transformations or afterwards.
> The syntax (sugared or not) is part and responsibility of the parser so an
> architecture with sub-parser hooks seems a bit brittle, especially if we
> start using it extensively.
> Having said that you have thought of this much more than I did so maybe
> the hook's approach is a better idea after all :)
>
> Best,
> Stamatis
>
> On Fri, Oct 23, 2020 at 2:26 PM Pau Tallada <ta...@pic.es> wrote:
>
>> Hi all,
>>
>> I do not know if that may be of interest to you, but there are other
>> projects that could benefit from this.
>> For instance, ADQL
>> <https://www.ivoa.net/documents/ADQL/20180112/PR-ADQL-2.1-20180112.html>
>> (Astronomical Data Query Language) is a SQL-like language that defines some
>> higher-level functions that enable powerful geospatial queries. Projects
>> like queryparser <https://github.com/aipescience/queryparser> are able
>> to translate from ADQL to vendor-SQL for MySQL or PostreSQL. In this case,
>> the syntactic sugar is implemented as an external layer on top, but could
>> very well be implemented in a rewrite hook if available.
>>
>> Cheers,
>>
>> Pau.
>>
>> Missatge de Peter Vary <pv...@cloudera.com> del dia dj., 22 d’oct. 2020
>> a les 16:21:
>>
>>>
>>> Let's assume that this feature would be useful for Iceberg tables, but
>>> useless and even problematic/forbidden for other tables. :)
>>>
>>> My thinking is, that it could make Hive much more user friendly, if we
>>> would allow for extensions in language.
>>>
>>> With Iceberg integration we plan to do several extensions which might
>>> not be useful for other tables. Some examples:
>>>
>>>    - When creating tables we want to send additional information to the
>>>    storage layer, and pushing everything in properties is a pain (not really
>>>    user friendly)
>>>    - We would like to allow querying table history for iceberg tables
>>>    (previous snapshotId-s, timestamps, etc)
>>>    - We would like to allow time travel for iceberg tables based on the
>>>    data queried above
>>>    - We would like to allow the user to see / manage / remove old
>>>    snapshots
>>>
>>>
>>> These are all very specific Iceberg related stuff, and most probably
>>> will not work / useful for any other type of the tables, so I think adding
>>> them to Hive parser would be a stretch.
>>>
>>> On the other hand if we do not provide SQL interface for accessing these
>>> features then the users will turn to Spark/Impala/Presto to be able to work
>>> with Iceberg tables.
>>>
>>> As for your specific question for handling syntax errors (I have just
>>> started to think about how would I do it, so feel free to suggest better
>>> methods):
>>>
>>>    - Let's assume that we have a hook which can get the sql command as
>>>    an input and can rewrite it to a new SQL command
>>>    - I would write simplified parser which tries to be as simple as
>>>    possible for the specific command
>>>    - Based on the parsing I would return the same command / throw an
>>>    exception / rewrite the command
>>>
>>>
>>> Admittedly this solution is working only if we can make every feature
>>> work without changing other part of Hive, and we just want to add
>>> "syntactic sugar" to it. (Do not underestimate the benefits of syntactic
>>> sugar :))
>>>
>>> Thanks,
>>> Peter
>>>
>>>
>>> On Oct 22, 2020, at 11:44, Stamatis Zampetakis <za...@gmail.com>
>>> wrote:
>>>
>>> Hi Peter,
>>>
>>> I am nowhere near being an expert but just wanted to share my thoughts.
>>>
>>> If I understand correctly you would like some syntactic sugar in Hive to
>>> support partitioning as per Iceberg. I cannot tell if that's really useful
>>> or not but from my point of view it doesn't seem a very good idea to
>>> introduce another layer of parsing before the actual parser (don't know if
>>> there is one already). For instance, how are you gonna handle the situation
>>> where there are syntax errors in your sugared part and what the end user
>>> should see?
>>>
>>> No matter how it is added if you give the possibility to the user to
>>> write such queries it becomes part of the Hive syntax and as such a job of
>>> the parser.
>>>
>>> Best,
>>> Stamatis
>>>
>>>
>>> On Thu, Oct 22, 2020 at 9:49 AM Peter Vary <pv...@cloudera.com> wrote:
>>>
>>>> Hi Hive experts,
>>>>
>>>> I would like to extend Hive SQL language to provide a way to create
>>>> Iceberg partitioned tables like this:
>>>>
>>>> create table iceberg_test(
>>>>         level string,
>>>>         event_time timestamp,
>>>>         message string,
>>>>         register_time date,
>>>>         telephone array <string>
>>>>     )
>>>>     partition by spec(
>>>>         level identity,
>>>>         event_time identity,
>>>>         event_time hour,
>>>>         register_time day
>>>>     )
>>>>     stored as iceberg;
>>>>
>>>>
>>>> The problem is that this syntax is very specific of Iceberg, and I
>>>> think it is not a good idea to change the Hive syntax globally to
>>>> accommodate a specific use-case.
>>>> The following CREATE TABLE statement could archive the same thing:
>>>>
>>>> create table iceberg_test(
>>>>         level string,
>>>>         event_time timestamp,
>>>>         message string,
>>>>         register_time date,
>>>>         telephone array <string>
>>>>     )
>>>>     STORED BY 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler'
>>>>     TBLPROPERTIES ('iceberg.mr.table.partition.spec'='...');
>>>>
>>>>
>>>> I am looking for a way to rewrite the original (Hive syntactically not
>>>> correct) query to a new (syntactically correct) one.
>>>>
>>>> I was checking the hooks as a possible solution, but I have found that:
>>>>
>>>>    - HiveDriverRunHook.preDriverRun can get the original /
>>>>    syntactically not correct query, but I have found no way to rewrite it to a
>>>>    syntactically correct one (it looks like a read only query)
>>>>    - HiveSemanticAnalyzerHook can rewrite the AST tree, but it needs a
>>>>    syntactically correct query to start with
>>>>
>>>>
>>>> Any other ideas how to archive the goals above? Either with Hooks, or
>>>> with any other way?
>>>>
>>>> Thanks,
>>>> Peter
>>>>
>>>
>>>
>>
>> --
>> ----------------------------------
>> Pau Tallada Crespí
>> Departament de Serveis
>> Port d'Informació Científica (PIC)
>> Tel: +34 93 170 2729
>> ----------------------------------
>>
>>

Re: Hive SQL extension

Posted by Stamatis Zampetakis <za...@gmail.com>.

I do like extensions and things that simplify our life when
writing queries.

Regarding the partitioning syntax for Iceberg, there may be better
alternatives.
I was also leaning towards a syntax like the one proposed by Jesus (in
another thread) based on virtual columns, which is also part of SQL
standard.

Regarding the other use cases mentioned (temporal queries, time travel
etc.) there are things that are part of SQL standard so we could start from
there and then introduce extensions if needed.

Syntactic sugar is powerful but in terms of design I find it more
appropriate to perform "desugaring" after having an AST; either AST to AST
transformations or afterwards.
The syntax (sugared or not) is part and responsibility of the parser so an
architecture with sub-parser hooks seems a bit brittle, especially if we
start using it extensively.
Having said that you have thought of this much more than I did so maybe the
hook's approach is a better idea after all :)

Best,
Stamatis

On Fri, Oct 23, 2020 at 2:26 PM Pau Tallada <ta...@pic.es> wrote:

> Hi all,
>
> I do not know if that may be of interest to you, but there are other
> projects that could benefit from this.
> For instance, ADQL
> <https://www.ivoa.net/documents/ADQL/20180112/PR-ADQL-2.1-20180112.html>
> (Astronomical Data Query Language) is a SQL-like language that defines some
> higher-level functions that enable powerful geospatial queries. Projects
> like queryparser <https://github.com/aipescience/queryparser> are able to
> translate from ADQL to vendor-SQL for MySQL or PostreSQL. In this case, the
> syntactic sugar is implemented as an external layer on top, but could very
> well be implemented in a rewrite hook if available.
>
> Cheers,
>
> Pau.
>
> Missatge de Peter Vary <pv...@cloudera.com> del dia dj., 22 d’oct. 2020 a
> les 16:21:
>
>>
>> Let's assume that this feature would be useful for Iceberg tables, but
>> useless and even problematic/forbidden for other tables. :)
>>
>> My thinking is, that it could make Hive much more user friendly, if we
>> would allow for extensions in language.
>>
>> With Iceberg integration we plan to do several extensions which might not
>> be useful for other tables. Some examples:
>>
>>    - When creating tables we want to send additional information to the
>>    storage layer, and pushing everything in properties is a pain (not really
>>    user friendly)
>>    - We would like to allow querying table history for iceberg tables
>>    (previous snapshotId-s, timestamps, etc)
>>    - We would like to allow time travel for iceberg tables based on the
>>    data queried above
>>    - We would like to allow the user to see / manage / remove old
>>    snapshots
>>
>>
>> These are all very specific Iceberg related stuff, and most probably will
>> not work / useful for any other type of the tables, so I think adding them
>> to Hive parser would be a stretch.
>>
>> On the other hand if we do not provide SQL interface for accessing these
>> features then the users will turn to Spark/Impala/Presto to be able to work
>> with Iceberg tables.
>>
>> As for your specific question for handling syntax errors (I have just
>> started to think about how would I do it, so feel free to suggest better
>> methods):
>>
>>    - Let's assume that we have a hook which can get the sql command as
>>    an input and can rewrite it to a new SQL command
>>    - I would write simplified parser which tries to be as simple as
>>    possible for the specific command
>>    - Based on the parsing I would return the same command / throw an
>>    exception / rewrite the command
>>
>>
>> Admittedly this solution is working only if we can make every feature
>> work without changing other part of Hive, and we just want to add
>> "syntactic sugar" to it. (Do not underestimate the benefits of syntactic
>> sugar :))
>>
>> Thanks,
>> Peter
>>
>>
>> On Oct 22, 2020, at 11:44, Stamatis Zampetakis <za...@gmail.com> wrote:
>>
>> Hi Peter,
>>
>> I am nowhere near being an expert but just wanted to share my thoughts.
>>
>> If I understand correctly you would like some syntactic sugar in Hive to
>> support partitioning as per Iceberg. I cannot tell if that's really useful
>> or not but from my point of view it doesn't seem a very good idea to
>> introduce another layer of parsing before the actual parser (don't know if
>> there is one already). For instance, how are you gonna handle the situation
>> where there are syntax errors in your sugared part and what the end user
>> should see?
>>
>> No matter how it is added if you give the possibility to the user to
>> write such queries it becomes part of the Hive syntax and as such a job of
>> the parser.
>>
>> Best,
>> Stamatis
>>
>>
>> On Thu, Oct 22, 2020 at 9:49 AM Peter Vary <pv...@cloudera.com> wrote:
>>
>>> Hi Hive experts,
>>>
>>> I would like to extend Hive SQL language to provide a way to create
>>> Iceberg partitioned tables like this:
>>>
>>> create table iceberg_test(
>>>         level string,
>>>         event_time timestamp,
>>>         message string,
>>>         register_time date,
>>>         telephone array <string>
>>>     )
>>>     partition by spec(
>>>         level identity,
>>>         event_time identity,
>>>         event_time hour,
>>>         register_time day
>>>     )
>>>     stored as iceberg;
>>>
>>>
>>> The problem is that this syntax is very specific of Iceberg, and I think
>>> it is not a good idea to change the Hive syntax globally to accommodate a
>>> specific use-case.
>>> The following CREATE TABLE statement could archive the same thing:
>>>
>>> create table iceberg_test(
>>>         level string,
>>>         event_time timestamp,
>>>         message string,
>>>         register_time date,
>>>         telephone array <string>
>>>     )
>>>     STORED BY 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler'
>>>     TBLPROPERTIES ('iceberg.mr.table.partition.spec'='...');
>>>
>>>
>>> I am looking for a way to rewrite the original (Hive syntactically not
>>> correct) query to a new (syntactically correct) one.
>>>
>>> I was checking the hooks as a possible solution, but I have found that:
>>>
>>>    - HiveDriverRunHook.preDriverRun can get the original /
>>>    syntactically not correct query, but I have found no way to rewrite it to a
>>>    syntactically correct one (it looks like a read only query)
>>>    - HiveSemanticAnalyzerHook can rewrite the AST tree, but it needs a
>>>    syntactically correct query to start with
>>>
>>>
>>> Any other ideas how to archive the goals above? Either with Hooks, or
>>> with any other way?
>>>
>>> Thanks,
>>> Peter
>>>
>>
>>
>
> --
> ----------------------------------
> Pau Tallada Crespí
> Departament de Serveis
> Port d'Informació Científica (PIC)
> Tel: +34 93 170 2729
> ----------------------------------
>
>

Re: Hive SQL extension

Posted by Stamatis Zampetakis <za...@gmail.com>.

I do like extensions and things that simplify our life when
writing queries.

Regarding the partitioning syntax for Iceberg, there may be better
alternatives.
I was also leaning towards a syntax like the one proposed by Jesus (in
another thread) based on virtual columns, which is also part of SQL
standard.

Regarding the other use cases mentioned (temporal queries, time travel
etc.) there are things that are part of SQL standard so we could start from
there and then introduce extensions if needed.

Syntactic sugar is powerful but in terms of design I find it more
appropriate to perform "desugaring" after having an AST; either AST to AST
transformations or afterwards.
The syntax (sugared or not) is part and responsibility of the parser so an
architecture with sub-parser hooks seems a bit brittle, especially if we
start using it extensively.
Having said that you have thought of this much more than I did so maybe the
hook's approach is a better idea after all :)

Best,
Stamatis

On Fri, Oct 23, 2020 at 2:26 PM Pau Tallada <ta...@pic.es> wrote:

> Hi all,
>
> I do not know if that may be of interest to you, but there are other
> projects that could benefit from this.
> For instance, ADQL
> <https://www.ivoa.net/documents/ADQL/20180112/PR-ADQL-2.1-20180112.html>
> (Astronomical Data Query Language) is a SQL-like language that defines some
> higher-level functions that enable powerful geospatial queries. Projects
> like queryparser <https://github.com/aipescience/queryparser> are able to
> translate from ADQL to vendor-SQL for MySQL or PostreSQL. In this case, the
> syntactic sugar is implemented as an external layer on top, but could very
> well be implemented in a rewrite hook if available.
>
> Cheers,
>
> Pau.
>
> Missatge de Peter Vary <pv...@cloudera.com> del dia dj., 22 d’oct. 2020 a
> les 16:21:
>
>>
>> Let's assume that this feature would be useful for Iceberg tables, but
>> useless and even problematic/forbidden for other tables. :)
>>
>> My thinking is, that it could make Hive much more user friendly, if we
>> would allow for extensions in language.
>>
>> With Iceberg integration we plan to do several extensions which might not
>> be useful for other tables. Some examples:
>>
>>    - When creating tables we want to send additional information to the
>>    storage layer, and pushing everything in properties is a pain (not really
>>    user friendly)
>>    - We would like to allow querying table history for iceberg tables
>>    (previous snapshotId-s, timestamps, etc)
>>    - We would like to allow time travel for iceberg tables based on the
>>    data queried above
>>    - We would like to allow the user to see / manage / remove old
>>    snapshots
>>
>>
>> These are all very specific Iceberg related stuff, and most probably will
>> not work / useful for any other type of the tables, so I think adding them
>> to Hive parser would be a stretch.
>>
>> On the other hand if we do not provide SQL interface for accessing these
>> features then the users will turn to Spark/Impala/Presto to be able to work
>> with Iceberg tables.
>>
>> As for your specific question for handling syntax errors (I have just
>> started to think about how would I do it, so feel free to suggest better
>> methods):
>>
>>    - Let's assume that we have a hook which can get the sql command as
>>    an input and can rewrite it to a new SQL command
>>    - I would write simplified parser which tries to be as simple as
>>    possible for the specific command
>>    - Based on the parsing I would return the same command / throw an
>>    exception / rewrite the command
>>
>>
>> Admittedly this solution is working only if we can make every feature
>> work without changing other part of Hive, and we just want to add
>> "syntactic sugar" to it. (Do not underestimate the benefits of syntactic
>> sugar :))
>>
>> Thanks,
>> Peter
>>
>>
>> On Oct 22, 2020, at 11:44, Stamatis Zampetakis <za...@gmail.com> wrote:
>>
>> Hi Peter,
>>
>> I am nowhere near being an expert but just wanted to share my thoughts.
>>
>> If I understand correctly you would like some syntactic sugar in Hive to
>> support partitioning as per Iceberg. I cannot tell if that's really useful
>> or not but from my point of view it doesn't seem a very good idea to
>> introduce another layer of parsing before the actual parser (don't know if
>> there is one already). For instance, how are you gonna handle the situation
>> where there are syntax errors in your sugared part and what the end user
>> should see?
>>
>> No matter how it is added if you give the possibility to the user to
>> write such queries it becomes part of the Hive syntax and as such a job of
>> the parser.
>>
>> Best,
>> Stamatis
>>
>>
>> On Thu, Oct 22, 2020 at 9:49 AM Peter Vary <pv...@cloudera.com> wrote:
>>
>>> Hi Hive experts,
>>>
>>> I would like to extend Hive SQL language to provide a way to create
>>> Iceberg partitioned tables like this:
>>>
>>> create table iceberg_test(
>>>         level string,
>>>         event_time timestamp,
>>>         message string,
>>>         register_time date,
>>>         telephone array <string>
>>>     )
>>>     partition by spec(
>>>         level identity,
>>>         event_time identity,
>>>         event_time hour,
>>>         register_time day
>>>     )
>>>     stored as iceberg;
>>>
>>>
>>> The problem is that this syntax is very specific of Iceberg, and I think
>>> it is not a good idea to change the Hive syntax globally to accommodate a
>>> specific use-case.
>>> The following CREATE TABLE statement could archive the same thing:
>>>
>>> create table iceberg_test(
>>>         level string,
>>>         event_time timestamp,
>>>         message string,
>>>         register_time date,
>>>         telephone array <string>
>>>     )
>>>     STORED BY 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler'
>>>     TBLPROPERTIES ('iceberg.mr.table.partition.spec'='...');
>>>
>>>
>>> I am looking for a way to rewrite the original (Hive syntactically not
>>> correct) query to a new (syntactically correct) one.
>>>
>>> I was checking the hooks as a possible solution, but I have found that:
>>>
>>>    - HiveDriverRunHook.preDriverRun can get the original /
>>>    syntactically not correct query, but I have found no way to rewrite it to a
>>>    syntactically correct one (it looks like a read only query)
>>>    - HiveSemanticAnalyzerHook can rewrite the AST tree, but it needs a
>>>    syntactically correct query to start with
>>>
>>>
>>> Any other ideas how to archive the goals above? Either with Hooks, or
>>> with any other way?
>>>
>>> Thanks,
>>> Peter
>>>
>>
>>
>
> --
> ----------------------------------
> Pau Tallada Crespí
> Departament de Serveis
> Port d'Informació Científica (PIC)
> Tel: +34 93 170 2729
> ----------------------------------
>
>

Re: Hive SQL extension

Posted by Pau Tallada <ta...@pic.es>.

Hi all,

I do not know if that may be of interest to you, but there are other
projects that could benefit from this.
For instance, ADQL
<https://www.ivoa.net/documents/ADQL/20180112/PR-ADQL-2.1-20180112.html>
(Astronomical Data Query Language) is a SQL-like language that defines some
higher-level functions that enable powerful geospatial queries. Projects
like queryparser <https://github.com/aipescience/queryparser> are able to
translate from ADQL to vendor-SQL for MySQL or PostreSQL. In this case, the
syntactic sugar is implemented as an external layer on top, but could very
well be implemented in a rewrite hook if available.

Cheers,

Pau.

Missatge de Peter Vary <pv...@cloudera.com> del dia dj., 22 d’oct. 2020 a
les 16:21:

>
> Let's assume that this feature would be useful for Iceberg tables, but
> useless and even problematic/forbidden for other tables. :)
>
> My thinking is, that it could make Hive much more user friendly, if we
> would allow for extensions in language.
>
> With Iceberg integration we plan to do several extensions which might not
> be useful for other tables. Some examples:
>
>    - When creating tables we want to send additional information to the
>    storage layer, and pushing everything in properties is a pain (not really
>    user friendly)
>    - We would like to allow querying table history for iceberg tables
>    (previous snapshotId-s, timestamps, etc)
>    - We would like to allow time travel for iceberg tables based on the
>    data queried above
>    - We would like to allow the user to see / manage / remove old
>    snapshots
>
>
> These are all very specific Iceberg related stuff, and most probably will
> not work / useful for any other type of the tables, so I think adding them
> to Hive parser would be a stretch.
>
> On the other hand if we do not provide SQL interface for accessing these
> features then the users will turn to Spark/Impala/Presto to be able to work
> with Iceberg tables.
>
> As for your specific question for handling syntax errors (I have just
> started to think about how would I do it, so feel free to suggest better
> methods):
>
>    - Let's assume that we have a hook which can get the sql command as an
>    input and can rewrite it to a new SQL command
>    - I would write simplified parser which tries to be as simple as
>    possible for the specific command
>    - Based on the parsing I would return the same command / throw an
>    exception / rewrite the command
>
>
> Admittedly this solution is working only if we can make every feature work
> without changing other part of Hive, and we just want to add "syntactic
> sugar" to it. (Do not underestimate the benefits of syntactic sugar :))
>
> Thanks,
> Peter
>
>
> On Oct 22, 2020, at 11:44, Stamatis Zampetakis <za...@gmail.com> wrote:
>
> Hi Peter,
>
> I am nowhere near being an expert but just wanted to share my thoughts.
>
> If I understand correctly you would like some syntactic sugar in Hive to
> support partitioning as per Iceberg. I cannot tell if that's really useful
> or not but from my point of view it doesn't seem a very good idea to
> introduce another layer of parsing before the actual parser (don't know if
> there is one already). For instance, how are you gonna handle the situation
> where there are syntax errors in your sugared part and what the end user
> should see?
>
> No matter how it is added if you give the possibility to the user to write
> such queries it becomes part of the Hive syntax and as such a job of the
> parser.
>
> Best,
> Stamatis
>
>
> On Thu, Oct 22, 2020 at 9:49 AM Peter Vary <pv...@cloudera.com> wrote:
>
>> Hi Hive experts,
>>
>> I would like to extend Hive SQL language to provide a way to create
>> Iceberg partitioned tables like this:
>>
>> create table iceberg_test(
>>         level string,
>>         event_time timestamp,
>>         message string,
>>         register_time date,
>>         telephone array <string>
>>     )
>>     partition by spec(
>>         level identity,
>>         event_time identity,
>>         event_time hour,
>>         register_time day
>>     )
>>     stored as iceberg;
>>
>>
>> The problem is that this syntax is very specific of Iceberg, and I think
>> it is not a good idea to change the Hive syntax globally to accommodate a
>> specific use-case.
>> The following CREATE TABLE statement could archive the same thing:
>>
>> create table iceberg_test(
>>         level string,
>>         event_time timestamp,
>>         message string,
>>         register_time date,
>>         telephone array <string>
>>     )
>>     STORED BY 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler'
>>     TBLPROPERTIES ('iceberg.mr.table.partition.spec'='...');
>>
>>
>> I am looking for a way to rewrite the original (Hive syntactically not
>> correct) query to a new (syntactically correct) one.
>>
>> I was checking the hooks as a possible solution, but I have found that:
>>
>>    - HiveDriverRunHook.preDriverRun can get the original / syntactically
>>    not correct query, but I have found no way to rewrite it to a syntactically
>>    correct one (it looks like a read only query)
>>    - HiveSemanticAnalyzerHook can rewrite the AST tree, but it needs a
>>    syntactically correct query to start with
>>
>>
>> Any other ideas how to archive the goals above? Either with Hooks, or
>> with any other way?
>>
>> Thanks,
>> Peter
>>
>
>

-- 
----------------------------------
Pau Tallada Crespí
Departament de Serveis
Port d'Informació Científica (PIC)
Tel: +34 93 170 2729
----------------------------------

Re: Hive SQL extension

Posted by Pau Tallada <ta...@pic.es>.

Hi all,

I do not know if that may be of interest to you, but there are other
projects that could benefit from this.
For instance, ADQL
<https://www.ivoa.net/documents/ADQL/20180112/PR-ADQL-2.1-20180112.html>
(Astronomical Data Query Language) is a SQL-like language that defines some
higher-level functions that enable powerful geospatial queries. Projects
like queryparser <https://github.com/aipescience/queryparser> are able to
translate from ADQL to vendor-SQL for MySQL or PostreSQL. In this case, the
syntactic sugar is implemented as an external layer on top, but could very
well be implemented in a rewrite hook if available.

Cheers,

Pau.

Missatge de Peter Vary <pv...@cloudera.com> del dia dj., 22 d’oct. 2020 a
les 16:21:

>
> Let's assume that this feature would be useful for Iceberg tables, but
> useless and even problematic/forbidden for other tables. :)
>
> My thinking is, that it could make Hive much more user friendly, if we
> would allow for extensions in language.
>
> With Iceberg integration we plan to do several extensions which might not
> be useful for other tables. Some examples:
>
>    - When creating tables we want to send additional information to the
>    storage layer, and pushing everything in properties is a pain (not really
>    user friendly)
>    - We would like to allow querying table history for iceberg tables
>    (previous snapshotId-s, timestamps, etc)
>    - We would like to allow time travel for iceberg tables based on the
>    data queried above
>    - We would like to allow the user to see / manage / remove old
>    snapshots
>
>
> These are all very specific Iceberg related stuff, and most probably will
> not work / useful for any other type of the tables, so I think adding them
> to Hive parser would be a stretch.
>
> On the other hand if we do not provide SQL interface for accessing these
> features then the users will turn to Spark/Impala/Presto to be able to work
> with Iceberg tables.
>
> As for your specific question for handling syntax errors (I have just
> started to think about how would I do it, so feel free to suggest better
> methods):
>
>    - Let's assume that we have a hook which can get the sql command as an
>    input and can rewrite it to a new SQL command
>    - I would write simplified parser which tries to be as simple as
>    possible for the specific command
>    - Based on the parsing I would return the same command / throw an
>    exception / rewrite the command
>
>
> Admittedly this solution is working only if we can make every feature work
> without changing other part of Hive, and we just want to add "syntactic
> sugar" to it. (Do not underestimate the benefits of syntactic sugar :))
>
> Thanks,
> Peter
>
>
> On Oct 22, 2020, at 11:44, Stamatis Zampetakis <za...@gmail.com> wrote:
>
> Hi Peter,
>
> I am nowhere near being an expert but just wanted to share my thoughts.
>
> If I understand correctly you would like some syntactic sugar in Hive to
> support partitioning as per Iceberg. I cannot tell if that's really useful
> or not but from my point of view it doesn't seem a very good idea to
> introduce another layer of parsing before the actual parser (don't know if
> there is one already). For instance, how are you gonna handle the situation
> where there are syntax errors in your sugared part and what the end user
> should see?
>
> No matter how it is added if you give the possibility to the user to write
> such queries it becomes part of the Hive syntax and as such a job of the
> parser.
>
> Best,
> Stamatis
>
>
> On Thu, Oct 22, 2020 at 9:49 AM Peter Vary <pv...@cloudera.com> wrote:
>
>> Hi Hive experts,
>>
>> I would like to extend Hive SQL language to provide a way to create
>> Iceberg partitioned tables like this:
>>
>> create table iceberg_test(
>>         level string,
>>         event_time timestamp,
>>         message string,
>>         register_time date,
>>         telephone array <string>
>>     )
>>     partition by spec(
>>         level identity,
>>         event_time identity,
>>         event_time hour,
>>         register_time day
>>     )
>>     stored as iceberg;
>>
>>
>> The problem is that this syntax is very specific of Iceberg, and I think
>> it is not a good idea to change the Hive syntax globally to accommodate a
>> specific use-case.
>> The following CREATE TABLE statement could archive the same thing:
>>
>> create table iceberg_test(
>>         level string,
>>         event_time timestamp,
>>         message string,
>>         register_time date,
>>         telephone array <string>
>>     )
>>     STORED BY 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler'
>>     TBLPROPERTIES ('iceberg.mr.table.partition.spec'='...');
>>
>>
>> I am looking for a way to rewrite the original (Hive syntactically not
>> correct) query to a new (syntactically correct) one.
>>
>> I was checking the hooks as a possible solution, but I have found that:
>>
>>    - HiveDriverRunHook.preDriverRun can get the original / syntactically
>>    not correct query, but I have found no way to rewrite it to a syntactically
>>    correct one (it looks like a read only query)
>>    - HiveSemanticAnalyzerHook can rewrite the AST tree, but it needs a
>>    syntactically correct query to start with
>>
>>
>> Any other ideas how to archive the goals above? Either with Hooks, or
>> with any other way?
>>
>> Thanks,
>> Peter
>>
>
>

-- 
----------------------------------
Pau Tallada Crespí
Departament de Serveis
Port d'Informació Científica (PIC)
Tel: +34 93 170 2729
----------------------------------

Re: Hive SQL extension

Posted by Peter Vary <pv...@cloudera.com>.

Let's assume that this feature would be useful for Iceberg tables, but useless and even problematic/forbidden for other tables. :)

My thinking is, that it could make Hive much more user friendly, if we would allow for extensions in language.

With Iceberg integration we plan to do several extensions which might not be useful for other tables. Some examples:
When creating tables we want to send additional information to the storage layer, and pushing everything in properties is a pain (not really user friendly)
We would like to allow querying table history for iceberg tables (previous snapshotId-s, timestamps, etc)
We would like to allow time travel for iceberg tables based on the data queried above
We would like to allow the user to see / manage / remove old snapshots

These are all very specific Iceberg related stuff, and most probably will not work / useful for any other type of the tables, so I think adding them to Hive parser would be a stretch.

On the other hand if we do not provide SQL interface for accessing these features then the users will turn to Spark/Impala/Presto to be able to work with Iceberg tables.

As for your specific question for handling syntax errors (I have just started to think about how would I do it, so feel free to suggest better methods):
Let's assume that we have a hook which can get the sql command as an input and can rewrite it to a new SQL command
I would write simplified parser which tries to be as simple as possible for the specific command
Based on the parsing I would return the same command / throw an exception / rewrite the command

Admittedly this solution is working only if we can make every feature work without changing other part of Hive, and we just want to add "syntactic sugar" to it. (Do not underestimate the benefits of syntactic sugar :))

Thanks,
Peter

> On Oct 22, 2020, at 11:44, Stamatis Zampetakis <za...@gmail.com> wrote:
> 
> Hi Peter,
> 
> I am nowhere near being an expert but just wanted to share my thoughts.
> 
> If I understand correctly you would like some syntactic sugar in Hive to support partitioning as per Iceberg. I cannot tell if that's really useful or not but from my point of view it doesn't seem a very good idea to introduce another layer of parsing before the actual parser (don't know if there is one already). For instance, how are you gonna handle the situation where there are syntax errors in your sugared part and what the end user should see? 
> 
> No matter how it is added if you give the possibility to the user to write such queries it becomes part of the Hive syntax and as such a job of the parser. 
> 
> Best,
> Stamatis
> 
> 
> On Thu, Oct 22, 2020 at 9:49 AM Peter Vary <pvary@cloudera.com <ma...@cloudera.com>> wrote:
> Hi Hive experts,
> 
> I would like to extend Hive SQL language to provide a way to create Iceberg partitioned tables like this:
> create table iceberg_test(
>         level string,
>         event_time timestamp,
>         message string,
>         register_time date,
>         telephone array <string>
>     )
>     partition by spec(
>         level identity,
>         event_time identity,
>         event_time hour,
>         register_time day
>     )
>     stored as iceberg;
> 
> The problem is that this syntax is very specific of Iceberg, and I think it is not a good idea to change the Hive syntax globally to accommodate a specific use-case.
> The following CREATE TABLE statement could archive the same thing:
> create table iceberg_test(
>         level string,
>         event_time timestamp,
>         message string,
>         register_time date,
>         telephone array <string>
>     )
>     STORED BY 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler' 
>     TBLPROPERTIES ('iceberg.mr.table.partition.spec'='...');
> 
> I am looking for a way to rewrite the original (Hive syntactically not correct) query to a new (syntactically correct) one.
> 
> I was checking the hooks as a possible solution, but I have found that:
> HiveDriverRunHook.preDriverRun can get the original / syntactically not correct query, but I have found no way to rewrite it to a syntactically correct one (it looks like a read only query)
> HiveSemanticAnalyzerHook can rewrite the AST tree, but it needs a syntactically correct query to start with
> 
> Any other ideas how to archive the goals above? Either with Hooks, or with any other way?
> 
> Thanks,
> Peter

Re: Hive SQL extension

Posted by Peter Vary <pv...@cloudera.com.INVALID>.

Let's assume that this feature would be useful for Iceberg tables, but useless and even problematic/forbidden for other tables. :)

My thinking is, that it could make Hive much more user friendly, if we would allow for extensions in language.

With Iceberg integration we plan to do several extensions which might not be useful for other tables. Some examples:
When creating tables we want to send additional information to the storage layer, and pushing everything in properties is a pain (not really user friendly)
We would like to allow querying table history for iceberg tables (previous snapshotId-s, timestamps, etc)
We would like to allow time travel for iceberg tables based on the data queried above
We would like to allow the user to see / manage / remove old snapshots

These are all very specific Iceberg related stuff, and most probably will not work / useful for any other type of the tables, so I think adding them to Hive parser would be a stretch.

On the other hand if we do not provide SQL interface for accessing these features then the users will turn to Spark/Impala/Presto to be able to work with Iceberg tables.

As for your specific question for handling syntax errors (I have just started to think about how would I do it, so feel free to suggest better methods):
Let's assume that we have a hook which can get the sql command as an input and can rewrite it to a new SQL command
I would write simplified parser which tries to be as simple as possible for the specific command
Based on the parsing I would return the same command / throw an exception / rewrite the command

Admittedly this solution is working only if we can make every feature work without changing other part of Hive, and we just want to add "syntactic sugar" to it. (Do not underestimate the benefits of syntactic sugar :))

Thanks,
Peter

> On Oct 22, 2020, at 11:44, Stamatis Zampetakis <za...@gmail.com> wrote:
> 
> Hi Peter,
> 
> I am nowhere near being an expert but just wanted to share my thoughts.
> 
> If I understand correctly you would like some syntactic sugar in Hive to support partitioning as per Iceberg. I cannot tell if that's really useful or not but from my point of view it doesn't seem a very good idea to introduce another layer of parsing before the actual parser (don't know if there is one already). For instance, how are you gonna handle the situation where there are syntax errors in your sugared part and what the end user should see? 
> 
> No matter how it is added if you give the possibility to the user to write such queries it becomes part of the Hive syntax and as such a job of the parser. 
> 
> Best,
> Stamatis
> 
> 
> On Thu, Oct 22, 2020 at 9:49 AM Peter Vary <pvary@cloudera.com <ma...@cloudera.com>> wrote:
> Hi Hive experts,
> 
> I would like to extend Hive SQL language to provide a way to create Iceberg partitioned tables like this:
> create table iceberg_test(
>         level string,
>         event_time timestamp,
>         message string,
>         register_time date,
>         telephone array <string>
>     )
>     partition by spec(
>         level identity,
>         event_time identity,
>         event_time hour,
>         register_time day
>     )
>     stored as iceberg;
> 
> The problem is that this syntax is very specific of Iceberg, and I think it is not a good idea to change the Hive syntax globally to accommodate a specific use-case.
> The following CREATE TABLE statement could archive the same thing:
> create table iceberg_test(
>         level string,
>         event_time timestamp,
>         message string,
>         register_time date,
>         telephone array <string>
>     )
>     STORED BY 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler' 
>     TBLPROPERTIES ('iceberg.mr.table.partition.spec'='...');
> 
> I am looking for a way to rewrite the original (Hive syntactically not correct) query to a new (syntactically correct) one.
> 
> I was checking the hooks as a possible solution, but I have found that:
> HiveDriverRunHook.preDriverRun can get the original / syntactically not correct query, but I have found no way to rewrite it to a syntactically correct one (it looks like a read only query)
> HiveSemanticAnalyzerHook can rewrite the AST tree, but it needs a syntactically correct query to start with
> 
> Any other ideas how to archive the goals above? Either with Hooks, or with any other way?
> 
> Thanks,
> Peter

Re: Hive SQL extension

Posted by Stamatis Zampetakis <za...@gmail.com>.

Hi Peter,

I am nowhere near being an expert but just wanted to share my thoughts.

If I understand correctly you would like some syntactic sugar in Hive to
support partitioning as per Iceberg. I cannot tell if that's really useful
or not but from my point of view it doesn't seem a very good idea to
introduce another layer of parsing before the actual parser (don't know if
there is one already). For instance, how are you gonna handle the situation
where there are syntax errors in your sugared part and what the end user
should see?

No matter how it is added if you give the possibility to the user to write
such queries it becomes part of the Hive syntax and as such a job of the
parser.

Best,
Stamatis

On Thu, Oct 22, 2020 at 9:49 AM Peter Vary <pv...@cloudera.com> wrote:

> Hi Hive experts,
>
> I would like to extend Hive SQL language to provide a way to create
> Iceberg partitioned tables like this:
>
> create table iceberg_test(
>         level string,
>         event_time timestamp,
>         message string,
>         register_time date,
>         telephone array <string>
>     )
>     partition by spec(
>         level identity,
>         event_time identity,
>         event_time hour,
>         register_time day
>     )
>     stored as iceberg;
>
>
> The problem is that this syntax is very specific of Iceberg, and I think
> it is not a good idea to change the Hive syntax globally to accommodate a
> specific use-case.
> The following CREATE TABLE statement could archive the same thing:
>
> create table iceberg_test(
>         level string,
>         event_time timestamp,
>         message string,
>         register_time date,
>         telephone array <string>
>     )
>     STORED BY 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler'
>     TBLPROPERTIES ('iceberg.mr.table.partition.spec'='...');
>
>
> I am looking for a way to rewrite the original (Hive syntactically not
> correct) query to a new (syntactically correct) one.
>
> I was checking the hooks as a possible solution, but I have found that:
>
>    - HiveDriverRunHook.preDriverRun can get the original / syntactically
>    not correct query, but I have found no way to rewrite it to a syntactically
>    correct one (it looks like a read only query)
>    - HiveSemanticAnalyzerHook can rewrite the AST tree, but it needs a
>    syntactically correct query to start with
>
>
> Any other ideas how to archive the goals above? Either with Hooks, or with
> any other way?
>
> Thanks,
> Peter
>

Re: Hive SQL extension

Posted by Stamatis Zampetakis <za...@gmail.com>.

Hi Peter,

I am nowhere near being an expert but just wanted to share my thoughts.

If I understand correctly you would like some syntactic sugar in Hive to
support partitioning as per Iceberg. I cannot tell if that's really useful
or not but from my point of view it doesn't seem a very good idea to
introduce another layer of parsing before the actual parser (don't know if
there is one already). For instance, how are you gonna handle the situation
where there are syntax errors in your sugared part and what the end user
should see?

No matter how it is added if you give the possibility to the user to write
such queries it becomes part of the Hive syntax and as such a job of the
parser.

Best,
Stamatis

On Thu, Oct 22, 2020 at 9:49 AM Peter Vary <pv...@cloudera.com> wrote:

> Hi Hive experts,
>
> I would like to extend Hive SQL language to provide a way to create
> Iceberg partitioned tables like this:
>
> create table iceberg_test(
>         level string,
>         event_time timestamp,
>         message string,
>         register_time date,
>         telephone array <string>
>     )
>     partition by spec(
>         level identity,
>         event_time identity,
>         event_time hour,
>         register_time day
>     )
>     stored as iceberg;
>
>
> The problem is that this syntax is very specific of Iceberg, and I think
> it is not a good idea to change the Hive syntax globally to accommodate a
> specific use-case.
> The following CREATE TABLE statement could archive the same thing:
>
> create table iceberg_test(
>         level string,
>         event_time timestamp,
>         message string,
>         register_time date,
>         telephone array <string>
>     )
>     STORED BY 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler'
>     TBLPROPERTIES ('iceberg.mr.table.partition.spec'='...');
>
>
> I am looking for a way to rewrite the original (Hive syntactically not
> correct) query to a new (syntactically correct) one.
>
> I was checking the hooks as a possible solution, but I have found that:
>
>    - HiveDriverRunHook.preDriverRun can get the original / syntactically
>    not correct query, but I have found no way to rewrite it to a syntactically
>    correct one (it looks like a read only query)
>    - HiveSemanticAnalyzerHook can rewrite the AST tree, but it needs a
>    syntactically correct query to start with
>
>
> Any other ideas how to archive the goals above? Either with Hooks, or with
> any other way?
>
> Thanks,
> Peter
>