You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@iotdb.apache.org by "Xiangdong Huang (Jira)" <ji...@apache.org> on 2021/09/18 01:27:00 UTC
[jira] [Commented] (IOTDB-1280) Rewrite the Antlr grammar file

    [ https://issues.apache.org/jira/browse/IOTDB-1280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17416983#comment-17416983 ] 

Xiangdong Huang commented on IOTDB-1280:
----------------------------------------

more, considering some complex cases that the PartialPath has special chars:

 

{code:sql}

create timeseries  root.select.insert.a.bcd."c s"."a.s"."dfdsf\-@#$%^&(()[]"

{code}

 

`select` and `insert` are keywords, and in current ANtlr, we claim it in `nodeName`, but I think the following format is better:

 

{code:sql}

create timeseries  root."select"."insert".a.bcd."c s"."a.s"."dfdsf\\-@#$%^&(()[]".1."a\"b"

{code}

 

Considering the above Path into Influxdb schema, it is like:

tag_name=tag:

l1="root"

l2="select"

l3="insert"

l4="a"

l5="bcd"

l6="c s"

l7="a.s"

l8="dfdsf\\-@#$%^&(()[]"

l9=1

 

field name="a\"b"

 

InfluxDB users usually add double quote to wrap the tag values like:

insert cpu,hostname="127.0.0.1",rack="rack1"

 

but current IoTDB users have no such the habit.

 

More:

we may not save the double quote as a part of the string into MTree,

root

|-select

     |-a.s

         |-a

            |-1

 

but when we query the path, we need to let users know "a.s" is a node, rather than 2 nodes in two levels.

i.e., root."select"."a.s"."a".1

 

 

> Rewrite the Antlr grammar file
> ------------------------------
>
>                 Key: IOTDB-1280
>                 URL: https://issues.apache.org/jira/browse/IOTDB-1280
>             Project: Apache IoTDB
>          Issue Type: Task
>          Components: Planner/SQLParser
>            Reporter: Xiangdong Huang
>            Priority: Major
>
> Current antlr g4 file is not elegant.
> 1. We should realize that Lexer and Parser having different usage:
>  - Lexer is for considering something into one "word", i.e., a string that will not be split in users' program.  E.g., "abcdere324234", "-2.0", "-2e5", "root.sg.a.b" (if we do not split the path any more... which is impossible). All rules that belong to Lexer should start with a Capital Char.
> - Parser is for generating AST. According to Antlr's introduction articles, parser is more time consuming than lexer.  The fewer rules that parser has, the faster.
> That is, if something can be defined in Lexer, we should not define them as parser.
> IMO, the principle is: 
>  - if we need to split a word into several words, then put the rule into parser; E.g., "1h"/"1m", as we have to split the word and get its time unit, why not define it as two words directly in Antlr (i.e., do not define  a lexer rule like: "DURATION: (INT+ (Y|M O|W|D|H|M|S|M S|U S|N S))+")?
> More, we can test whether a lexer is fast enough for checking a Path. If so, we do not define another Pattern in our program to check whether a  time series name is legal.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)