You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Elliot West <te...@gmail.com> on 2018/02/16 17:49:04 UTC

HQL parser internals

Hello,

We need to be able to parse and manipulate an HQL query. To date we’ve been
intercepting and transforming the parse tree in the SemanticAnalyzerHook.
However, ideally we wish to manipulate the original query as delivered by
the user (or as close to it as possible), and we’re finding that the tree
has been modified significantly by the time it hits the hook. Additionally
we wish to track back ASTNodes to the character sequences in the source HQL
that were their origin (where sensible), and ultimately hope to be able
regenerate the query text from the AST.

We are expecting to need to write our own parser to do this, but hope to
use the Hive grammar at least for the foundation. I wonder if anyone
familiar with this code might be able to suggest some good staring points.

The use case, if you are interested, is a mutation testing framework for
HQL. The testing of mutants is operational, but now we need to report on
survivors, hence the need to track back from specific query elements to
character sequences in the original query string.

Thanks,

Elliot.

Re: HQL parser internals

Posted by Furcy Pin <pi...@gmail.com>.

Hi Elliot,

Yes, the variable substitution is done before the parsing. This make
generic query validation much more complicated.

As I explained in my previous message, what I did was replacing these
${VARS} by strings "${VARS}" that the HiveParser would agree to parse,
and that I could recognize afterwards...



On 19 March 2018 at 22:33, Elliot West <te...@gmail.com> wrote:

> Hello again,
>
> We're now testing our system against a corpus of Hive SQL statements in an
> effort to quickly highlight edge cases, limitations etc. We're finding that
> org.apache.hadoop.hive.ql.parse.ParseUtils is stumbling on variables such
> as ${hiveconf:varname}. Are variable substitutions handled prior to
> parsing or within the parser itself? If in a pre-procesing stage, is there
> any code or utility classes within Hive that we can use as a reference, or
> to provide this functionality?
>
> Cheers,
>
> Elliot.
>
> On 19 February 2018 at 11:10, Elliot West <te...@gmail.com> wrote:
>
>> Thank you all for your rapid responses; some really useful information
>> and pointers in there.
>>
>> We'll keep the list updated with our progress.
>>
>> On 18 February 2018 at 19:00, Dharmesh Kakadia <dh...@gmail.com>
>> wrote:
>>
>>> +1 for using ParseDriver for this. I also have used it to intercept and
>>> augment query AST.
>>>
>>> Also, I would echo others sentiment that its quite ugly. It would be
>>> great if we can refactor/standardize this. That will make integrating
>>> other system a lot easier.
>>>
>>> Thanks,
>>> Dharmesh
>>>
>>> On Sat, Feb 17, 2018 at 12:07 AM, Furcy Pin <pi...@gmail.com> wrote:
>>>
>>>> Hi Elliot,
>>>>
>>>> Actually, I have done quite similar work regarding Hive custom Parsing,
>>>> you should have a look at my project: https://github.com/flaminem/flamy
>>>>
>>>> The Hive parsing related stuff is here: https://github.com/flami
>>>> nem/flamy/tree/master/src/main/scala/com/flaminem/flamy/parsing/hive
>>>> A good starting point to see how to parse queries is here:
>>>> https://github.com/flaminem/flamy/blob/master/src/main/scala
>>>> /com/flaminem/flamy/parsing/hive/PopulateParserInfo.scala#L492
>>>>
>>>>
>>>> Basically, all you need is to use a org.apache.hadoop.hive.ql.pars
>>>> e.ParseDriver.
>>>>
>>>> val pd: ParseDriver = new ParseDriver
>>>> val tree: ASTNode = pd.parse(query, hiveContext)
>>>>
>>>> You then get the ASTNode, that you can freely parse and change.
>>>> Also, I must say that it is quite ugly to manipulate, and the Presto
>>>> Parser seems to be much better designed (but it is not the same syntax,
>>>> unfortunately),
>>>> I recommend to look at it to get better design ideas.
>>>>
>>>>
>>>> If you want to enrich your Hive syntax like I did (I wanted to be able
>>>> to parse ${VARS} in queries),
>>>> you will not be able to use the HiveParser without some workaround.
>>>> What I did was replacing these ${VARS} by strings "${VARS}" that the
>>>> HiveParser would agree to parse,
>>>> and that I could recognize afterwards...
>>>>
>>>> Also, if you are familiar with Scala, I recommend using it, it helps a
>>>> lot...
>>>>
>>>> For instance, I have this class that transforms an AST back into a
>>>> string query:
>>>> https://github.com/flaminem/flamy/blob/master/src/main/scala
>>>> /com/flaminem/flamy/parsing/hive/ast/SQLFormatter.scala
>>>> I could never have done something that good looking in Java...
>>>>
>>>> Finally this method helps a lot to understand how the hell the AST
>>>> works:
>>>> https://github.com/flaminem/flamy/blob/master/src/main/scala
>>>> /com/flaminem/flamy/parsing/hive/HiveParserUtils.scala#L593
>>>>
>>>> Make sure to write *tons* of unit tests too, you'll need them.
>>>>
>>>> Hope this helps,
>>>>
>>>> Furcy
>>>>
>>>>
>>>>
>>>> On 16 February 2018 at 21:20, Gopal Vijayaraghavan <go...@apache.org>
>>>> wrote:
>>>>
>>>>>
>>>>> > However, ideally we wish to manipulate the original query as
>>>>> delivered by the user (or as close to it as possible), and we’re finding
>>>>> that the tree has been modified significantly by the time it hits the hook
>>>>>
>>>>> That's CBO. It takes the Query - > AST -> Calcite Tree -> AST -> hook
>>>>> - the bushy join conversion is already done by the time the hook gets
>>>>> called.
>>>>>
>>>>> We need a Parser hook to hook it ahead of CBO, not a Semantic Analyzer
>>>>> hook.
>>>>>
>>>>> > Additionally we wish to track back ASTNodes to the character
>>>>> sequences in the source HQL that were their origin (where sensible), and
>>>>> ultimately hope to be able regenerate the query text from the AST.
>>>>>
>>>>> I started work on a Hive-unparser a while back based on this class,
>>>>> but it a world of verbose coding.
>>>>>
>>>>> https://github.com/apache/hive/blob/master/ql/src/java/org/a
>>>>> pache/hadoop/hive/ql/optimizer/calcite/translator/ASTConvert
>>>>> er.java#L850
>>>>>
>>>>> If you're doing active work on this, I'd like to help, because I need
>>>>> the AST -> query to debug CBO.
>>>>>
>>>>> > The use case, if you are interested, is a mutation testing framework
>>>>> for HQL. The testing of mutants is operational, but now we need to report
>>>>> on survivors, hence the need to track back from specific query elements to
>>>>> character sequences in the original query string.
>>>>>
>>>>> This sounds a lot like the fuzzing random-query-gen used in Cloudera
>>>>> to have Impala vs Hive bug-for-bug compat.
>>>>>
>>>>> https://cwiki.apache.org/confluence/download/attachments/273
>>>>> 62054/Random%20Query%20Gen-%20Hive%20Meetup.pptx
>>>>>
>>>>> Cheers,
>>>>> Gopal
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: HQL parser internals

Posted by Pengcheng Xiong <px...@apache.org>.

Hi Jay,

     Yes, Hive parser is stripping out the "as". As Gopal explained in the
earlier thread, parser is used to translate a SQL string into an AST, i.e.,
an AST that the compiler can understand. That said, as long as the
following compiler knows that "oh, this is a renaming", it is OK. In
another word, it is OK if you want to add the "as" back as long as your
compiler understands. In Hive's case, IMHO, I do not think it is necessary
to add it back because the compiler already knows what it means.

Best
Pengcheng

On Fri, Apr 13, 2018 at 6:29 AM, Furcy Pin <pi...@gmail.com> wrote:

> Hi Jay,
>
> I noticed the same thing when I did my tool, and it makes sense are
> syntactically they are both equivalent, so the Hive parser does not care.
> You could probably update the HiveParser so that it keeps the information
> in the AST, but not without breaking every part of the code that reads that
> AST I'm afraid.
>
> In the long run, I believe that it would be really valuable if someone had
> the courage to completely rewrite Hive parsing using best practices
> and good tools (e.g. ANTLR 4), but I doubt that it will ever happen
> unfortunately :-(
>
>
>
>
> On 13 April 2018 at 14:25, Jay Green-Stevens <t-...@hotels.com>
> wrote:
>
>> Afternoon all,
>>
>>
>>
>> We have successfully managed to build a java tool which will translate a
>> hive query into a syntax tree, and then turn this back into a hive query
>> equivalent to the input.
>>
>>
>>
>> But we have found that for a query such as
>>
>> *select a *
>>
>> *from (*
>>
>> *select a *
>>
>> *from b*
>>
>> *) as c*
>>
>>
>>
>> the hive parser is stripping out the ‘as’ when building the tree. This
>> then means that when the query string is rebuilt the output is ‘*select
>> a from (select a from b) c’*, and although this is technically valid it
>> is not equivalent to the input query.
>>
>>
>>
>> Is there any way we can fix this issue?
>>
>> Could we change the parser in some way to stop the ‘as’ from being
>> stripped out?
>>
>>
>>
>> Any ideas would be greatly appreciated!
>>
>>
>>
>> Thanks,
>>
>>
>>
>> Jay
>>
>>
>>
>>
>>
>>
>>
>>
>> From: *Elliot West* <te...@gmail.com>
>> Date: 19 March 2018 at 21:33
>> Subject: Re: HQL parser internals
>> To: user@hive.apache.org
>>
>> Hello again,
>>
>>
>>
>> We're now testing our system against a corpus of Hive SQL statements in
>> an effort to quickly highlight edge cases, limitations etc. We're finding
>> that org.apache.hadoop.hive.ql.parse.ParseUtils is stumbling on
>> variables such as ${hiveconf:varname}. Are variable substitutions
>> handled prior to parsing or within the parser itself? If in a pre-procesing
>> stage, is there any code or utility classes within Hive that we can use as
>> a reference, or to provide this functionality?
>>
>>
>>
>> Cheers,
>>
>>
>>
>> Elliot.
>>
>>
>>
>> On 19 February 2018 at 11:10, Elliot West <te...@gmail.com> wrote:
>>
>> Thank you all for your rapid responses; some really useful information
>> and pointers in there.
>>
>>
>>
>> We'll keep the list updated with our progress.
>>
>>
>>
>> On 18 February 2018 at 19:00, Dharmesh Kakadia <dh...@gmail.com>
>> wrote:
>>
>> +1 for using ParseDriver for this. I also have used it to intercept and
>> augment query AST.
>>
>>
>>
>> Also, I would echo others sentiment that its quite ugly. It would be
>> great if we can refactor/standardize this. That will make integrating
>> other system a lot easier.
>>
>>
>>
>> Thanks,
>>
>> Dharmesh
>>
>>
>>
>> On Sat, Feb 17, 2018 at 12:07 AM, Furcy Pin <pi...@gmail.com> wrote:
>>
>> Hi Elliot,
>>
>>
>>
>> Actually, I have done quite similar work regarding Hive custom Parsing,
>> you should have a look at my project: https://github.com/flaminem/flamy
>>
>>
>>
>> The Hive parsing related stuff is here: https://github.com/flami
>> nem/flamy/tree/master/src/main/scala/com/flaminem/flamy/parsing/hive
>> A good starting point to see how to parse queries is here:
>>
>> https://github.com/flaminem/flamy/blob/master/src/main/scala
>> /com/flaminem/flamy/parsing/hive/PopulateParserInfo.scala#L492
>>
>>
>>
>>
>>
>> Basically, all you need is to use a org.apache.hadoop.hive.ql.pars
>> e.ParseDriver.
>>
>>
>>
>> val pd: ParseDriver = new ParseDriver
>>
>> val tree: ASTNode = pd.parse(query, hiveContext)
>>
>> You then get the ASTNode, that you can freely parse and change.
>>
>> Also, I must say that it is quite ugly to manipulate, and the Presto
>> Parser seems to be much better designed (but it is not the same syntax,
>> unfortunately),
>>
>> I recommend to look at it to get better design ideas.
>>
>>
>>
>>
>>
>> If you want to enrich your Hive syntax like I did (I wanted to be able to
>> parse ${VARS} in queries),
>>
>> you will not be able to use the HiveParser without some workaround.
>>
>> What I did was replacing these ${VARS} by strings "${VARS}" that the
>> HiveParser would agree to parse,
>>
>> and that I could recognize afterwards...
>>
>>
>>
>> Also, if you are familiar with Scala, I recommend using it, it helps a
>> lot...
>>
>>
>>
>> For instance, I have this class that transforms an AST back into a string
>> query:
>>
>> https://github.com/flaminem/flamy/blob/master/src/main/scala
>> /com/flaminem/flamy/parsing/hive/ast/SQLFormatter.scala
>>
>> I could never have done something that good looking in Java...
>>
>>
>>
>> Finally this method helps a lot to understand how the hell the AST works:
>>
>> https://github.com/flaminem/flamy/blob/master/src/main/scala
>> /com/flaminem/flamy/parsing/hive/HiveParserUtils.scala#L593
>>
>>
>>
>> Make sure to write *tons* of unit tests too, you'll need them.
>>
>>
>>
>> Hope this helps,
>>
>>
>>
>> Furcy
>>
>>
>>
>>
>>
>>
>>
>> On 16 February 2018 at 21:20, Gopal Vijayaraghavan <go...@apache.org>
>> wrote:
>>
>>
>> > However, ideally we wish to manipulate the original query as delivered
>> by the user (or as close to it as possible), and we’re finding that the
>> tree has been modified significantly by the time it hits the hook
>>
>> That's CBO. It takes the Query - > AST -> Calcite Tree -> AST -> hook -
>> the bushy join conversion is already done by the time the hook gets called.
>>
>> We need a Parser hook to hook it ahead of CBO, not a Semantic Analyzer
>> hook.
>>
>> > Additionally we wish to track back ASTNodes to the character sequences
>> in the source HQL that were their origin (where sensible), and ultimately
>> hope to be able regenerate the query text from the AST.
>>
>> I started work on a Hive-unparser a while back based on this class, but
>> it a world of verbose coding.
>>
>> https://github.com/apache/hive/blob/master/ql/src/java/org/
>> apache/hadoop/hive/ql/optimizer/calcite/translator/ASTConverter.java#L850
>>
>> If you're doing active work on this, I'd like to help, because I need the
>> AST -> query to debug CBO.
>>
>> > The use case, if you are interested, is a mutation testing framework
>> for HQL. The testing of mutants is operational, but now we need to report
>> on survivors, hence the need to track back from specific query elements to
>> character sequences in the original query string.
>>
>> This sounds a lot like the fuzzing random-query-gen used in Cloudera to
>> have Impala vs Hive bug-for-bug compat.
>>
>> https://cwiki.apache.org/confluence/download/attachments/
>> 27362054/Random%20Query%20Gen-%20Hive%20Meetup.pptx
>>
>> Cheers,
>> Gopal
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>
>

Re: HQL parser internals

Posted by Furcy Pin <pi...@gmail.com>.

Hi Jay,

I noticed the same thing when I did my tool, and it makes sense are
syntactically they are both equivalent, so the Hive parser does not care.
You could probably update the HiveParser so that it keeps the information
in the AST, but not without breaking every part of the code that reads that
AST I'm afraid.

In the long run, I believe that it would be really valuable if someone had
the courage to completely rewrite Hive parsing using best practices
and good tools (e.g. ANTLR 4), but I doubt that it will ever happen
unfortunately :-(




On 13 April 2018 at 14:25, Jay Green-Stevens <t-...@hotels.com>
wrote:

> Afternoon all,
>
>
>
> We have successfully managed to build a java tool which will translate a
> hive query into a syntax tree, and then turn this back into a hive query
> equivalent to the input.
>
>
>
> But we have found that for a query such as
>
> *select a *
>
> *from (*
>
> *select a *
>
> *from b*
>
> *) as c*
>
>
>
> the hive parser is stripping out the ‘as’ when building the tree. This
> then means that when the query string is rebuilt the output is ‘*select a
> from (select a from b) c’*, and although this is technically valid it is
> not equivalent to the input query.
>
>
>
> Is there any way we can fix this issue?
>
> Could we change the parser in some way to stop the ‘as’ from being
> stripped out?
>
>
>
> Any ideas would be greatly appreciated!
>
>
>
> Thanks,
>
>
>
> Jay
>
>
>
>
>
>
>
>
> From: *Elliot West* <te...@gmail.com>
> Date: 19 March 2018 at 21:33
> Subject: Re: HQL parser internals
> To: user@hive.apache.org
>
> Hello again,
>
>
>
> We're now testing our system against a corpus of Hive SQL statements in an
> effort to quickly highlight edge cases, limitations etc. We're finding that
> org.apache.hadoop.hive.ql.parse.ParseUtils is stumbling on variables such
> as ${hiveconf:varname}. Are variable substitutions handled prior to
> parsing or within the parser itself? If in a pre-procesing stage, is there
> any code or utility classes within Hive that we can use as a reference, or
> to provide this functionality?
>
>
>
> Cheers,
>
>
>
> Elliot.
>
>
>
> On 19 February 2018 at 11:10, Elliot West <te...@gmail.com> wrote:
>
> Thank you all for your rapid responses; some really useful information and
> pointers in there.
>
>
>
> We'll keep the list updated with our progress.
>
>
>
> On 18 February 2018 at 19:00, Dharmesh Kakadia <dh...@gmail.com>
> wrote:
>
> +1 for using ParseDriver for this. I also have used it to intercept and
> augment query AST.
>
>
>
> Also, I would echo others sentiment that its quite ugly. It would be great
> if we can refactor/standardize this. That will make integrating other
> system a lot easier.
>
>
>
> Thanks,
>
> Dharmesh
>
>
>
> On Sat, Feb 17, 2018 at 12:07 AM, Furcy Pin <pi...@gmail.com> wrote:
>
> Hi Elliot,
>
>
>
> Actually, I have done quite similar work regarding Hive custom Parsing,
> you should have a look at my project: https://github.com/flaminem/flamy
>
>
>
> The Hive parsing related stuff is here: https://github.com/
> flaminem/flamy/tree/master/src/main/scala/com/flaminem/flamy/parsing/hive
> A good starting point to see how to parse queries is here:
>
> https://github.com/flaminem/flamy/blob/master/src/main/
> scala/com/flaminem/flamy/parsing/hive/PopulateParserInfo.scala#L492
>
>
>
>
>
> Basically, all you need is to use a org.apache.hadoop.hive.ql.
> parse.ParseDriver.
>
>
>
> val pd: ParseDriver = new ParseDriver
>
> val tree: ASTNode = pd.parse(query, hiveContext)
>
> You then get the ASTNode, that you can freely parse and change.
>
> Also, I must say that it is quite ugly to manipulate, and the Presto
> Parser seems to be much better designed (but it is not the same syntax,
> unfortunately),
>
> I recommend to look at it to get better design ideas.
>
>
>
>
>
> If you want to enrich your Hive syntax like I did (I wanted to be able to
> parse ${VARS} in queries),
>
> you will not be able to use the HiveParser without some workaround.
>
> What I did was replacing these ${VARS} by strings "${VARS}" that the
> HiveParser would agree to parse,
>
> and that I could recognize afterwards...
>
>
>
> Also, if you are familiar with Scala, I recommend using it, it helps a
> lot...
>
>
>
> For instance, I have this class that transforms an AST back into a string
> query:
>
> https://github.com/flaminem/flamy/blob/master/src/main/
> scala/com/flaminem/flamy/parsing/hive/ast/SQLFormatter.scala
>
> I could never have done something that good looking in Java...
>
>
>
> Finally this method helps a lot to understand how the hell the AST works:
>
> https://github.com/flaminem/flamy/blob/master/src/main/
> scala/com/flaminem/flamy/parsing/hive/HiveParserUtils.scala#L593
>
>
>
> Make sure to write *tons* of unit tests too, you'll need them.
>
>
>
> Hope this helps,
>
>
>
> Furcy
>
>
>
>
>
>
>
> On 16 February 2018 at 21:20, Gopal Vijayaraghavan <go...@apache.org>
> wrote:
>
>
> > However, ideally we wish to manipulate the original query as delivered
> by the user (or as close to it as possible), and we’re finding that the
> tree has been modified significantly by the time it hits the hook
>
> That's CBO. It takes the Query - > AST -> Calcite Tree -> AST -> hook -
> the bushy join conversion is already done by the time the hook gets called.
>
> We need a Parser hook to hook it ahead of CBO, not a Semantic Analyzer
> hook.
>
> > Additionally we wish to track back ASTNodes to the character sequences
> in the source HQL that were their origin (where sensible), and ultimately
> hope to be able regenerate the query text from the AST.
>
> I started work on a Hive-unparser a while back based on this class, but it
> a world of verbose coding.
>
> https://github.com/apache/hive/blob/master/ql/src/java/
> org/apache/hadoop/hive/ql/optimizer/calcite/translator/
> ASTConverter.java#L850
>
> If you're doing active work on this, I'd like to help, because I need the
> AST -> query to debug CBO.
>
> > The use case, if you are interested, is a mutation testing framework for
> HQL. The testing of mutants is operational, but now we need to report on
> survivors, hence the need to track back from specific query elements to
> character sequences in the original query string.
>
> This sounds a lot like the fuzzing random-query-gen used in Cloudera to
> have Impala vs Hive bug-for-bug compat.
>
> https://cwiki.apache.org/confluence/download/attachments/27362054/Random%
> 20Query%20Gen-%20Hive%20Meetup.pptx
>
> Cheers,
> Gopal
>
>
>
>
>
>
>
>
>
>
>

Re: HQL parser internals

Posted by Jay Green-Stevens <t-...@hotels.com>.

Afternoon all,

We have successfully managed to build a java tool which will translate a hive query into a syntax tree, and then turn this back into a hive query equivalent to the input.

But we have found that for a query such as
select a
from (
select a
from b
) as c

the hive parser is stripping out the ‘as’ when building the tree. This then means that when the query string is rebuilt the output is ‘select a from (select a from b) c’, and although this is technically valid it is not equivalent to the input query.

Is there any way we can fix this issue?
Could we change the parser in some way to stop the ‘as’ from being stripped out?

Any ideas would be greatly appreciated!

Thanks,

Jay

From: Elliot West <te...@gmail.com>>
Date: 19 March 2018 at 21:33
Subject: Re: HQL parser internals
To: user@hive.apache.org<ma...@hive.apache.org>

Hello again,

We're now testing our system against a corpus of Hive SQL statements in an effort to quickly highlight edge cases, limitations etc. We're finding that org.apache.hadoop.hive.ql.parse.ParseUtils is stumbling on variables such as ${hiveconf:varname}. Are variable substitutions handled prior to parsing or within the parser itself? If in a pre-procesing stage, is there any code or utility classes within Hive that we can use as a reference, or to provide this functionality?

Cheers,

Elliot.

On 19 February 2018 at 11:10, Elliot West <te...@gmail.com>> wrote:
Thank you all for your rapid responses; some really useful information and pointers in there.

We'll keep the list updated with our progress.

On 18 February 2018 at 19:00, Dharmesh Kakadia <dh...@gmail.com>> wrote:
+1 for using ParseDriver for this. I also have used it to intercept and augment query AST.

Also, I would echo others sentiment that its quite ugly. It would be great if we can refactor/standardize this. That will make integrating other system a lot easier.

Thanks,
Dharmesh

On Sat, Feb 17, 2018 at 12:07 AM, Furcy Pin <pi...@gmail.com>> wrote:
Hi Elliot,

Actually, I have done quite similar work regarding Hive custom Parsing, you should have a look at my project: https://github.com/flaminem/flamy

The Hive parsing related stuff is here: https://github.com/flaminem/flamy/tree/master/src/main/scala/com/flaminem/flamy/parsing/hive
A good starting point to see how to parse queries is here:
https://github.com/flaminem/flamy/blob/master/src/main/scala/com/flaminem/flamy/parsing/hive/PopulateParserInfo.scala#L492

Basically, all you need is to use a org.apache.hadoop.hive.ql.parse.ParseDriver.

val pd: ParseDriver = new ParseDriver

val tree: ASTNode = pd.parse(query, hiveContext)

You then get the ASTNode, that you can freely parse and change.
Also, I must say that it is quite ugly to manipulate, and the Presto Parser seems to be much better designed (but it is not the same syntax, unfortunately),
I recommend to look at it to get better design ideas.

If you want to enrich your Hive syntax like I did (I wanted to be able to parse ${VARS} in queries),
you will not be able to use the HiveParser without some workaround.
What I did was replacing these ${VARS} by strings "${VARS}" that the HiveParser would agree to parse,
and that I could recognize afterwards...

Also, if you are familiar with Scala, I recommend using it, it helps a lot...

For instance, I have this class that transforms an AST back into a string query:
https://github.com/flaminem/flamy/blob/master/src/main/scala/com/flaminem/flamy/parsing/hive/ast/SQLFormatter.scala
I could never have done something that good looking in Java...

Finally this method helps a lot to understand how the hell the AST works:
https://github.com/flaminem/flamy/blob/master/src/main/scala/com/flaminem/flamy/parsing/hive/HiveParserUtils.scala#L593

Make sure to write tons of unit tests too, you'll need them.

Hope this helps,

Furcy

On 16 February 2018 at 21:20, Gopal Vijayaraghavan <go...@apache.org>> wrote:

> However, ideally we wish to manipulate the original query as delivered by the user (or as close to it as possible), and we’re finding that the tree has been modified significantly by the time it hits the hook

That's CBO. It takes the Query - > AST -> Calcite Tree -> AST -> hook - the bushy join conversion is already done by the time the hook gets called.

We need a Parser hook to hook it ahead of CBO, not a Semantic Analyzer hook.

> Additionally we wish to track back ASTNodes to the character sequences in the source HQL that were their origin (where sensible), and ultimately hope to be able regenerate the query text from the AST.

I started work on a Hive-unparser a while back based on this class, but it a world of verbose coding.

https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/translator/ASTConverter.java#L850

If you're doing active work on this, I'd like to help, because I need the AST -> query to debug CBO.

> The use case, if you are interested, is a mutation testing framework for HQL. The testing of mutants is operational, but now we need to report on survivors, hence the need to track back from specific query elements to character sequences in the original query string.

This sounds a lot like the fuzzing random-query-gen used in Cloudera to have Impala vs Hive bug-for-bug compat.

https://cwiki.apache.org/confluence/download/attachments/27362054/Random%20Query%20Gen-%20Hive%20Meetup.pptx

Cheers,
Gopal

Re: HQL parser internals

Posted by Elliot West <te...@gmail.com>.

Hello again,

We're now testing our system against a corpus of Hive SQL statements in an
effort to quickly highlight edge cases, limitations etc. We're finding that
org.apache.hadoop.hive.ql.parse.ParseUtils is stumbling on variables such
as ${hiveconf:varname}. Are variable substitutions handled prior to parsing
or within the parser itself? If in a pre-procesing stage, is there any code
or utility classes within Hive that we can use as a reference, or to
provide this functionality?

Cheers,

Elliot.

On 19 February 2018 at 11:10, Elliot West <te...@gmail.com> wrote:

> Thank you all for your rapid responses; some really useful information and
> pointers in there.
>
> We'll keep the list updated with our progress.
>
> On 18 February 2018 at 19:00, Dharmesh Kakadia <dh...@gmail.com>
> wrote:
>
>> +1 for using ParseDriver for this. I also have used it to intercept and
>> augment query AST.
>>
>> Also, I would echo others sentiment that its quite ugly. It would be
>> great if we can refactor/standardize this. That will make integrating
>> other system a lot easier.
>>
>> Thanks,
>> Dharmesh
>>
>> On Sat, Feb 17, 2018 at 12:07 AM, Furcy Pin <pi...@gmail.com> wrote:
>>
>>> Hi Elliot,
>>>
>>> Actually, I have done quite similar work regarding Hive custom Parsing,
>>> you should have a look at my project: https://github.com/flaminem/flamy
>>>
>>> The Hive parsing related stuff is here: https://github.com/flami
>>> nem/flamy/tree/master/src/main/scala/com/flaminem/flamy/parsing/hive
>>> A good starting point to see how to parse queries is here:
>>> https://github.com/flaminem/flamy/blob/master/src/main/scala
>>> /com/flaminem/flamy/parsing/hive/PopulateParserInfo.scala#L492
>>>
>>>
>>> Basically, all you need is to use a org.apache.hadoop.hive.ql.pars
>>> e.ParseDriver.
>>>
>>> val pd: ParseDriver = new ParseDriver
>>> val tree: ASTNode = pd.parse(query, hiveContext)
>>>
>>> You then get the ASTNode, that you can freely parse and change.
>>> Also, I must say that it is quite ugly to manipulate, and the Presto
>>> Parser seems to be much better designed (but it is not the same syntax,
>>> unfortunately),
>>> I recommend to look at it to get better design ideas.
>>>
>>>
>>> If you want to enrich your Hive syntax like I did (I wanted to be able
>>> to parse ${VARS} in queries),
>>> you will not be able to use the HiveParser without some workaround.
>>> What I did was replacing these ${VARS} by strings "${VARS}" that the
>>> HiveParser would agree to parse,
>>> and that I could recognize afterwards...
>>>
>>> Also, if you are familiar with Scala, I recommend using it, it helps a
>>> lot...
>>>
>>> For instance, I have this class that transforms an AST back into a
>>> string query:
>>> https://github.com/flaminem/flamy/blob/master/src/main/scala
>>> /com/flaminem/flamy/parsing/hive/ast/SQLFormatter.scala
>>> I could never have done something that good looking in Java...
>>>
>>> Finally this method helps a lot to understand how the hell the AST works:
>>> https://github.com/flaminem/flamy/blob/master/src/main/scala
>>> /com/flaminem/flamy/parsing/hive/HiveParserUtils.scala#L593
>>>
>>> Make sure to write *tons* of unit tests too, you'll need them.
>>>
>>> Hope this helps,
>>>
>>> Furcy
>>>
>>>
>>>
>>> On 16 February 2018 at 21:20, Gopal Vijayaraghavan <go...@apache.org>
>>> wrote:
>>>
>>>>
>>>> > However, ideally we wish to manipulate the original query as
>>>> delivered by the user (or as close to it as possible), and we’re finding
>>>> that the tree has been modified significantly by the time it hits the hook
>>>>
>>>> That's CBO. It takes the Query - > AST -> Calcite Tree -> AST -> hook -
>>>> the bushy join conversion is already done by the time the hook gets called.
>>>>
>>>> We need a Parser hook to hook it ahead of CBO, not a Semantic Analyzer
>>>> hook.
>>>>
>>>> > Additionally we wish to track back ASTNodes to the character
>>>> sequences in the source HQL that were their origin (where sensible), and
>>>> ultimately hope to be able regenerate the query text from the AST.
>>>>
>>>> I started work on a Hive-unparser a while back based on this class, but
>>>> it a world of verbose coding.
>>>>
>>>> https://github.com/apache/hive/blob/master/ql/src/java/org/a
>>>> pache/hadoop/hive/ql/optimizer/calcite/translator/ASTConvert
>>>> er.java#L850
>>>>
>>>> If you're doing active work on this, I'd like to help, because I need
>>>> the AST -> query to debug CBO.
>>>>
>>>> > The use case, if you are interested, is a mutation testing framework
>>>> for HQL. The testing of mutants is operational, but now we need to report
>>>> on survivors, hence the need to track back from specific query elements to
>>>> character sequences in the original query string.
>>>>
>>>> This sounds a lot like the fuzzing random-query-gen used in Cloudera to
>>>> have Impala vs Hive bug-for-bug compat.
>>>>
>>>> https://cwiki.apache.org/confluence/download/attachments/273
>>>> 62054/Random%20Query%20Gen-%20Hive%20Meetup.pptx
>>>>
>>>> Cheers,
>>>> Gopal
>>>>
>>>>
>>>>
>>>
>>
>

Re: HQL parser internals

Posted by Elliot West <te...@gmail.com>.

Thank you all for your rapid responses; some really useful information and
pointers in there.

We'll keep the list updated with our progress.

On 18 February 2018 at 19:00, Dharmesh Kakadia <dh...@gmail.com> wrote:

> +1 for using ParseDriver for this. I also have used it to intercept and
> augment query AST.
>
> Also, I would echo others sentiment that its quite ugly. It would be great
> if we can refactor/standardize this. That will make integrating other
> system a lot easier.
>
> Thanks,
> Dharmesh
>
> On Sat, Feb 17, 2018 at 12:07 AM, Furcy Pin <pi...@gmail.com> wrote:
>
>> Hi Elliot,
>>
>> Actually, I have done quite similar work regarding Hive custom Parsing,
>> you should have a look at my project: https://github.com/flaminem/flamy
>>
>> The Hive parsing related stuff is here: https://github.com/flami
>> nem/flamy/tree/master/src/main/scala/com/flaminem/flamy/parsing/hive
>> A good starting point to see how to parse queries is here:
>> https://github.com/flaminem/flamy/blob/master/src/main/scala
>> /com/flaminem/flamy/parsing/hive/PopulateParserInfo.scala#L492
>>
>>
>> Basically, all you need is to use a org.apache.hadoop.hive.ql.pars
>> e.ParseDriver.
>>
>> val pd: ParseDriver = new ParseDriver
>> val tree: ASTNode = pd.parse(query, hiveContext)
>>
>> You then get the ASTNode, that you can freely parse and change.
>> Also, I must say that it is quite ugly to manipulate, and the Presto
>> Parser seems to be much better designed (but it is not the same syntax,
>> unfortunately),
>> I recommend to look at it to get better design ideas.
>>
>>
>> If you want to enrich your Hive syntax like I did (I wanted to be able to
>> parse ${VARS} in queries),
>> you will not be able to use the HiveParser without some workaround.
>> What I did was replacing these ${VARS} by strings "${VARS}" that the
>> HiveParser would agree to parse,
>> and that I could recognize afterwards...
>>
>> Also, if you are familiar with Scala, I recommend using it, it helps a
>> lot...
>>
>> For instance, I have this class that transforms an AST back into a string
>> query:
>> https://github.com/flaminem/flamy/blob/master/src/main/scala
>> /com/flaminem/flamy/parsing/hive/ast/SQLFormatter.scala
>> I could never have done something that good looking in Java...
>>
>> Finally this method helps a lot to understand how the hell the AST works:
>> https://github.com/flaminem/flamy/blob/master/src/main/scala
>> /com/flaminem/flamy/parsing/hive/HiveParserUtils.scala#L593
>>
>> Make sure to write *tons* of unit tests too, you'll need them.
>>
>> Hope this helps,
>>
>> Furcy
>>
>>
>>
>> On 16 February 2018 at 21:20, Gopal Vijayaraghavan <go...@apache.org>
>> wrote:
>>
>>>
>>> > However, ideally we wish to manipulate the original query as delivered
>>> by the user (or as close to it as possible), and we’re finding that the
>>> tree has been modified significantly by the time it hits the hook
>>>
>>> That's CBO. It takes the Query - > AST -> Calcite Tree -> AST -> hook -
>>> the bushy join conversion is already done by the time the hook gets called.
>>>
>>> We need a Parser hook to hook it ahead of CBO, not a Semantic Analyzer
>>> hook.
>>>
>>> > Additionally we wish to track back ASTNodes to the character sequences
>>> in the source HQL that were their origin (where sensible), and ultimately
>>> hope to be able regenerate the query text from the AST.
>>>
>>> I started work on a Hive-unparser a while back based on this class, but
>>> it a world of verbose coding.
>>>
>>> https://github.com/apache/hive/blob/master/ql/src/java/org/a
>>> pache/hadoop/hive/ql/optimizer/calcite/translator/ASTConverter.java#L850
>>>
>>> If you're doing active work on this, I'd like to help, because I need
>>> the AST -> query to debug CBO.
>>>
>>> > The use case, if you are interested, is a mutation testing framework
>>> for HQL. The testing of mutants is operational, but now we need to report
>>> on survivors, hence the need to track back from specific query elements to
>>> character sequences in the original query string.
>>>
>>> This sounds a lot like the fuzzing random-query-gen used in Cloudera to
>>> have Impala vs Hive bug-for-bug compat.
>>>
>>> https://cwiki.apache.org/confluence/download/attachments/273
>>> 62054/Random%20Query%20Gen-%20Hive%20Meetup.pptx
>>>
>>> Cheers,
>>> Gopal
>>>
>>>
>>>
>>
>

Re: HQL parser internals

Posted by Dharmesh Kakadia <dh...@gmail.com>.

+1 for using ParseDriver for this. I also have used it to intercept and
augment query AST.

Also, I would echo others sentiment that its quite ugly. It would be great
if we can refactor/standardize this. That will make integrating other
system a lot easier.

Thanks,
Dharmesh

On Sat, Feb 17, 2018 at 12:07 AM, Furcy Pin <pi...@gmail.com> wrote:

> Hi Elliot,
>
> Actually, I have done quite similar work regarding Hive custom Parsing,
> you should have a look at my project: https://github.com/flaminem/flamy
>
> The Hive parsing related stuff is here: https://github.com/
> flaminem/flamy/tree/master/src/main/scala/com/flaminem/flamy/parsing/hive
> A good starting point to see how to parse queries is here:
> https://github.com/flaminem/flamy/blob/master/src/main/
> scala/com/flaminem/flamy/parsing/hive/PopulateParserInfo.scala#L492
>
>
> Basically, all you need is to use a org.apache.hadoop.hive.ql.
> parse.ParseDriver.
>
> val pd: ParseDriver = new ParseDriver
> val tree: ASTNode = pd.parse(query, hiveContext)
>
> You then get the ASTNode, that you can freely parse and change.
> Also, I must say that it is quite ugly to manipulate, and the Presto
> Parser seems to be much better designed (but it is not the same syntax,
> unfortunately),
> I recommend to look at it to get better design ideas.
>
>
> If you want to enrich your Hive syntax like I did (I wanted to be able to
> parse ${VARS} in queries),
> you will not be able to use the HiveParser without some workaround.
> What I did was replacing these ${VARS} by strings "${VARS}" that the
> HiveParser would agree to parse,
> and that I could recognize afterwards...
>
> Also, if you are familiar with Scala, I recommend using it, it helps a
> lot...
>
> For instance, I have this class that transforms an AST back into a string
> query:
> https://github.com/flaminem/flamy/blob/master/src/main/
> scala/com/flaminem/flamy/parsing/hive/ast/SQLFormatter.scala
> I could never have done something that good looking in Java...
>
> Finally this method helps a lot to understand how the hell the AST works:
> https://github.com/flaminem/flamy/blob/master/src/main/
> scala/com/flaminem/flamy/parsing/hive/HiveParserUtils.scala#L593
>
> Make sure to write *tons* of unit tests too, you'll need them.
>
> Hope this helps,
>
> Furcy
>
>
>
> On 16 February 2018 at 21:20, Gopal Vijayaraghavan <go...@apache.org>
> wrote:
>
>>
>> > However, ideally we wish to manipulate the original query as delivered
>> by the user (or as close to it as possible), and we’re finding that the
>> tree has been modified significantly by the time it hits the hook
>>
>> That's CBO. It takes the Query - > AST -> Calcite Tree -> AST -> hook -
>> the bushy join conversion is already done by the time the hook gets called.
>>
>> We need a Parser hook to hook it ahead of CBO, not a Semantic Analyzer
>> hook.
>>
>> > Additionally we wish to track back ASTNodes to the character sequences
>> in the source HQL that were their origin (where sensible), and ultimately
>> hope to be able regenerate the query text from the AST.
>>
>> I started work on a Hive-unparser a while back based on this class, but
>> it a world of verbose coding.
>>
>> https://github.com/apache/hive/blob/master/ql/src/java/org/
>> apache/hadoop/hive/ql/optimizer/calcite/translator/ASTConverter.java#L850
>>
>> If you're doing active work on this, I'd like to help, because I need the
>> AST -> query to debug CBO.
>>
>> > The use case, if you are interested, is a mutation testing framework
>> for HQL. The testing of mutants is operational, but now we need to report
>> on survivors, hence the need to track back from specific query elements to
>> character sequences in the original query string.
>>
>> This sounds a lot like the fuzzing random-query-gen used in Cloudera to
>> have Impala vs Hive bug-for-bug compat.
>>
>> https://cwiki.apache.org/confluence/download/attachments/
>> 27362054/Random%20Query%20Gen-%20Hive%20Meetup.pptx
>>
>> Cheers,
>> Gopal
>>
>>
>>
>

Re: HQL parser internals

Posted by Furcy Pin <pi...@gmail.com>.

Hi Elliot,

Actually, I have done quite similar work regarding Hive custom Parsing, you
should have a look at my project: https://github.com/flaminem/flamy

The Hive parsing related stuff is here:
https://github.com/flaminem/flamy/tree/master/src/main/scala/com/flaminem/flamy/parsing/hive
A good starting point to see how to parse queries is here:
https://github.com/flaminem/flamy/blob/master/src/main/scala/com/flaminem/flamy/parsing/hive/PopulateParserInfo.scala#L492

Basically, all you need is to use a
org.apache.hadoop.hive.ql.parse.ParseDriver.

val pd: ParseDriver = new ParseDriver
val tree: ASTNode = pd.parse(query, hiveContext)

You then get the ASTNode, that you can freely parse and change.
Also, I must say that it is quite ugly to manipulate, and the Presto Parser
seems to be much better designed (but it is not the same syntax,
unfortunately),
I recommend to look at it to get better design ideas.

If you want to enrich your Hive syntax like I did (I wanted to be able to
parse ${VARS} in queries),
you will not be able to use the HiveParser without some workaround.
What I did was replacing these ${VARS} by strings "${VARS}" that the
HiveParser would agree to parse,
and that I could recognize afterwards...

Also, if you are familiar with Scala, I recommend using it, it helps a
lot...

For instance, I have this class that transforms an AST back into a string
query:
https://github.com/flaminem/flamy/blob/master/src/main/scala/com/flaminem/flamy/parsing/hive/ast/SQLFormatter.scala
I could never have done something that good looking in Java...

Finally this method helps a lot to understand how the hell the AST works:
https://github.com/flaminem/flamy/blob/master/src/main/scala/com/flaminem/flamy/parsing/hive/HiveParserUtils.scala#L593

Make sure to write *tons* of unit tests too, you'll need them.

Hope this helps,

Furcy

On 16 February 2018 at 21:20, Gopal Vijayaraghavan <go...@apache.org>
wrote:

>
> > However, ideally we wish to manipulate the original query as delivered
> by the user (or as close to it as possible), and we’re finding that the
> tree has been modified significantly by the time it hits the hook
>
> That's CBO. It takes the Query - > AST -> Calcite Tree -> AST -> hook -
> the bushy join conversion is already done by the time the hook gets called.
>
> We need a Parser hook to hook it ahead of CBO, not a Semantic Analyzer
> hook.
>
> > Additionally we wish to track back ASTNodes to the character sequences
> in the source HQL that were their origin (where sensible), and ultimately
> hope to be able regenerate the query text from the AST.
>
> I started work on a Hive-unparser a while back based on this class, but it
> a world of verbose coding.
>
> https://github.com/apache/hive/blob/master/ql/src/java/
> org/apache/hadoop/hive/ql/optimizer/calcite/translator/
> ASTConverter.java#L850
>
> If you're doing active work on this, I'd like to help, because I need the
> AST -> query to debug CBO.
>
> > The use case, if you are interested, is a mutation testing framework for
> HQL. The testing of mutants is operational, but now we need to report on
> survivors, hence the need to track back from specific query elements to
> character sequences in the original query string.
>
> This sounds a lot like the fuzzing random-query-gen used in Cloudera to
> have Impala vs Hive bug-for-bug compat.
>
> https://cwiki.apache.org/confluence/download/attachments/27362054/Random%
> 20Query%20Gen-%20Hive%20Meetup.pptx
>
> Cheers,
> Gopal
>
>
>

Re: HQL parser internals

Posted by Gopal Vijayaraghavan <go...@apache.org>.

> However, ideally we wish to manipulate the original query as delivered by the user (or as close to it as possible), and we’re finding that the tree has been modified significantly by the time it hits the hook

That's CBO. It takes the Query - > AST -> Calcite Tree -> AST -> hook - the bushy join conversion is already done by the time the hook gets called.

We need a Parser hook to hook it ahead of CBO, not a Semantic Analyzer hook.

> Additionally we wish to track back ASTNodes to the character sequences in the source HQL that were their origin (where sensible), and ultimately hope to be able regenerate the query text from the AST.

I started work on a Hive-unparser a while back based on this class, but it a world of verbose coding.

https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/translator/ASTConverter.java#L850

If you're doing active work on this, I'd like to help, because I need the AST -> query to debug CBO.

> The use case, if you are interested, is a mutation testing framework for HQL. The testing of mutants is operational, but now we need to report on survivors, hence the need to track back from specific query elements to character sequences in the original query string.

This sounds a lot like the fuzzing random-query-gen used in Cloudera to have Impala vs Hive bug-for-bug compat.

https://cwiki.apache.org/confluence/download/attachments/27362054/Random%20Query%20Gen-%20Hive%20Meetup.pptx

Cheers,
Gopal