You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@iotdb.apache.org by "Xiangdong Huang (Jira)" <ji...@apache.org> on 2021/09/18 01:27:00 UTC
[jira] [Commented] (IOTDB-1280) Rewrite the Antlr grammar file
[ https://issues.apache.org/jira/browse/IOTDB-1280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17416983#comment-17416983 ]
Xiangdong Huang commented on IOTDB-1280:
----------------------------------------
more, considering some complex cases that the PartialPath has special chars:
{code:sql}
create timeseries root.select.insert.a.bcd."c s"."a.s"."dfdsf\-@#$%^&(()[]"
{code}
`select` and `insert` are keywords, and in current ANtlr, we claim it in `nodeName`, but I think the following format is better:
{code:sql}
create timeseries root."select"."insert".a.bcd."c s"."a.s"."dfdsf\\-@#$%^&(()[]".1."a\"b"
{code}
Considering the above Path into Influxdb schema, it is like:
tag_name=tag:
l1="root"
l2="select"
l3="insert"
l4="a"
l5="bcd"
l6="c s"
l7="a.s"
l8="dfdsf\\-@#$%^&(()[]"
l9=1
field name="a\"b"
InfluxDB users usually add double quote to wrap the tag values like:
insert cpu,hostname="127.0.0.1",rack="rack1"
but current IoTDB users have no such the habit.
More:
we may not save the double quote as a part of the string into MTree,
root
|-select
|-a.s
|-a
|-1
but when we query the path, we need to let users know "a.s" is a node, rather than 2 nodes in two levels.
i.e., root."select"."a.s"."a".1
> Rewrite the Antlr grammar file
> ------------------------------
>
> Key: IOTDB-1280
> URL: https://issues.apache.org/jira/browse/IOTDB-1280
> Project: Apache IoTDB
> Issue Type: Task
> Components: Planner/SQLParser
> Reporter: Xiangdong Huang
> Priority: Major
>
> Current antlr g4 file is not elegant.
> 1. We should realize that Lexer and Parser having different usage:
> - Lexer is for considering something into one "word", i.e., a string that will not be split in users' program. E.g., "abcdere324234", "-2.0", "-2e5", "root.sg.a.b" (if we do not split the path any more... which is impossible). All rules that belong to Lexer should start with a Capital Char.
> - Parser is for generating AST. According to Antlr's introduction articles, parser is more time consuming than lexer. The fewer rules that parser has, the faster.
> That is, if something can be defined in Lexer, we should not define them as parser.
> IMO, the principle is:
> - if we need to split a word into several words, then put the rule into parser; E.g., "1h"/"1m", as we have to split the word and get its time unit, why not define it as two words directly in Antlr (i.e., do not define a lexer rule like: "DURATION: (INT+ (Y|M O|W|D|H|M|S|M S|U S|N S))+")?
> More, we can test whether a lexer is fast enough for checking a Path. If so, we do not define another Pattern in our program to check whether a time series name is legal.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)