You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@calcite.apache.org by 柳尘 <yu...@gmail.com> on 2023/12/24 03:19:44 UTC

Feature: Support gremlin adapter

Motivation

Hi, community. Currently, more and more users are using some graph
databases, such as JanusGraph, HugeGraph, etc.
To do some relationship representation of personnel social networks,
It is used to count the activity of each user in the social network. Most
graph databases are in the graph building and graph traversal stage.
All will be implemented using Gremlin syntax.
However, this is very unfriendly to users who are not familiar with gremlin
syntax. While calcite exists as a query framework,
It also provides an adapter interface to adapt to different database
dialects, such as parsing, relational algebra conversion, and query plan
binding.
Our company has solved the problem of adapting various graph databases.
This is my warehouse: https://github.com/kaori-seasons/calcite-gremlin-sql


Background

Calcite itself supports the database language expansion of the adapter,
which enables users to understand the meaning of the grammar.
Complete the simplification of the dialect. For example, expand SqlNode to
implement syntax analysis, and expand RelNode to implement logical plan
mapping.

thinkerpop is an adaptation framework for various graph databases. In this
framework, gremlin syntax is mentioned for the first time.
It has now become a universal query layer for graph databases. While
expanding query statements through calcite’s adapter interface,
We will also use thinkerpop's universal graph database API to provide
dialect compatibility for different graph databases.

Give a simple example:
From

SELECT "key" FROM inttype

 maps to

g.V().hasLabel("inttype").group().unfold().select(Column.values).order().by(_.unfold().id()).project("inttype").
by(.project("key").by(.unfold().choose(.has("key"),.values("key"),_.constant("\$%#NULL#
%\$"))))





The design architecture is divided into three layers.

Analytical syntax layer, relational algebra transformation layer, logical
plan binding layer.

Parsing syntax layer: In the parsing query statement phase, fields and
equivalent conditions are split and converted into points and edges.

Relational algebra layer: Split it into a collection of points and edges,
and convert it into a TableScan during the aggregation/sorting/query stage
where calcite abstracts it.
It is convenient to generate query plans based on conditions and field
information.
Connection scanning/single table filtering and other methods can be used to
traverse from any edge/any starting point in the graph

Logical plan binding layer: Bind behaviors such as connection
scanning/single table filtering/projection to calcite’s planner to generate
query plans.

The process of generating Gremlin logical plan using select statement:

1. First of all, all graph databases start from a source point to build the
graph. We will use the GraphTraversalSource provided by thinkerpop.
As the origin, extract the syntax of the incoming point and side
information. This step will be implemented in SqlSchemaGrabber
2. Secondly, for select/where/having/order by/group by our plan in the
parsing phase is as follows:

  - group by: for a point. There are out-degree and in-degree
attributes in the graph. From the perspective of the data table, it is
equivalent to converting the table data into IN or OUT.
These two dimensions are aggregated. This behavior also corresponds to
the table traversal graph operation. During the graph traversal
process, fold/unfold tags will be generated, representing the
direction of graph traversal.
  - select: For the select operation, the operation of scanning the
entire table can be regarded as projecting all columns into point
attributes. The value changes of each column are mapped to the gremlin
operation of adding points.
  - where: The filter operation is used in graph computing semantic
scenarios. It can be regarded as the edges connected by the out-degree
and in-degree of the filter point, so it does not involve the
conversion of relational algebra.
  Instead, it is pushed directly to the logical plan.
  - order by: In the process of graph traversal, we mentioned that
fold/unflod will be generated on the path to represent the
forward/backward direction.
  If a field is encountered and there is no value that can be sorted,
we will fall back to the origin of GraphTraversalSource and end the
sorting operation.
  If there are values that can be sorted, they will be unified in
SqlTraversalEngine, in-degree and out-degree will be counted
separately for aggregation, and then used with group by according to
label (IN/OUT)
  - having: The meaning is the same as group by, but the label is
different (in addition to the IN/OUT columns, specific point fields
need to be specified)

 Currently, I have only completed unit tests that translate from SQL to
Gremlin execution plan, among which test cases for group by and where are
to be added. In addition, I will also use mainstream graph databases such
as neo4j and JanusGraph as examples to write sound integration tests. ,
ensuring that the API of the graph database is successfully called after
converting the sql request into the gremlin execution plan.

Finally, community members are welcome to give suggestions.

Re: Feature: Support gremlin adapter

Posted by 柳尘 <yu...@gmail.com>.
I invited pengzhiwei2018, who is an expert in the field of graph computing
and a calcite committer. I hope he can bring better suggestions in his
later reply.

柳尘 <yu...@gmail.com> 于2023年12月24日周日 11:19写道:

> Motivation
>
> Hi, community. Currently, more and more users are using some graph
> databases, such as JanusGraph, HugeGraph, etc.
> To do some relationship representation of personnel social networks,
> It is used to count the activity of each user in the social network. Most
> graph databases are in the graph building and graph traversal stage.
> All will be implemented using Gremlin syntax.
> However, this is very unfriendly to users who are not familiar with
> gremlin syntax. While calcite exists as a query framework,
> It also provides an adapter interface to adapt to different database
> dialects, such as parsing, relational algebra conversion, and query plan
> binding.
> Our company has solved the problem of adapting various graph databases.
> This is my warehouse: https://github.com/kaori-seasons/calcite-gremlin-sql
>
>
> Background
>
> Calcite itself supports the database language expansion of the adapter,
> which enables users to understand the meaning of the grammar.
> Complete the simplification of the dialect. For example, expand SqlNode to
> implement syntax analysis, and expand RelNode to implement logical plan
> mapping.
>
> thinkerpop is an adaptation framework for various graph databases. In this
> framework, gremlin syntax is mentioned for the first time.
> It has now become a universal query layer for graph databases. While
> expanding query statements through calcite’s adapter interface,
> We will also use thinkerpop's universal graph database API to provide
> dialect compatibility for different graph databases.
>
> Give a simple example:
> From
>
> SELECT "key" FROM inttype
>
>  maps to
>
> g.V().hasLabel("inttype").group().unfold().select(Column.values).order().by(_.unfold().id()).project("inttype"). by(.project("key").by(.unfold().choose(.has("key"),.values("key"),_.constant("\$%#NULL# %\$"))))
>
>
>
>
>
> The design architecture is divided into three layers.
>
> Analytical syntax layer, relational algebra transformation layer, logical
> plan binding layer.
>
> Parsing syntax layer: In the parsing query statement phase, fields and
> equivalent conditions are split and converted into points and edges.
>
> Relational algebra layer: Split it into a collection of points and edges,
> and convert it into a TableScan during the aggregation/sorting/query stage
> where calcite abstracts it.
> It is convenient to generate query plans based on conditions and field
> information.
> Connection scanning/single table filtering and other methods can be used
> to traverse from any edge/any starting point in the graph
>
> Logical plan binding layer: Bind behaviors such as connection
> scanning/single table filtering/projection to calcite’s planner to generate
> query plans.
>
> The process of generating Gremlin logical plan using select statement:
>
> 1. First of all, all graph databases start from a source point to build
> the graph. We will use the GraphTraversalSource provided by thinkerpop.
> As the origin, extract the syntax of the incoming point and side
> information. This step will be implemented in SqlSchemaGrabber
> 2. Secondly, for select/where/having/order by/group by our plan in the
> parsing phase is as follows:
>
>   - group by: for a point. There are out-degree and in-degree attributes in the graph. From the perspective of the data table, it is equivalent to converting the table data into IN or OUT.
> These two dimensions are aggregated. This behavior also corresponds to the table traversal graph operation. During the graph traversal process, fold/unfold tags will be generated, representing the direction of graph traversal.
>   - select: For the select operation, the operation of scanning the entire table can be regarded as projecting all columns into point attributes. The value changes of each column are mapped to the gremlin operation of adding points.
>   - where: The filter operation is used in graph computing semantic scenarios. It can be regarded as the edges connected by the out-degree and in-degree of the filter point, so it does not involve the conversion of relational algebra.
>   Instead, it is pushed directly to the logical plan.
>   - order by: In the process of graph traversal, we mentioned that fold/unflod will be generated on the path to represent the forward/backward direction.
>   If a field is encountered and there is no value that can be sorted, we will fall back to the origin of GraphTraversalSource and end the sorting operation.
>   If there are values that can be sorted, they will be unified in SqlTraversalEngine, in-degree and out-degree will be counted separately for aggregation, and then used with group by according to label (IN/OUT)
>   - having: The meaning is the same as group by, but the label is different (in addition to the IN/OUT columns, specific point fields need to be specified)
>
>  Currently, I have only completed unit tests that translate from SQL to
> Gremlin execution plan, among which test cases for group by and where are
> to be added. In addition, I will also use mainstream graph databases such
> as neo4j and JanusGraph as examples to write sound integration tests. ,
> ensuring that the API of the graph database is successfully called after
> converting the sql request into the gremlin execution plan.
>
> Finally, community members are welcome to give suggestions.
>
>
>

Re: Feature: Support gremlin adapter

Posted by 柳尘 <yu...@gmail.com>.
Thank you for your reply. You are welcome to comment and ask questions
about my proposal.

Forward Xu <fo...@gmail.com> 于2023年12月25日周一 09:51写道:

> hi
>
> This is a great feature to extend calcite from regular data queries to
> graph queries (calculations),
> +1 for looking forward to it.
>
> forwardxu
>
> 柳尘 <yu...@gmail.com> 于2023年12月24日周日 11:20写道:
>
> > Motivation
> >
> > Hi, community. Currently, more and more users are using some graph
> > databases, such as JanusGraph, HugeGraph, etc.
> > To do some relationship representation of personnel social networks,
> > It is used to count the activity of each user in the social network. Most
> > graph databases are in the graph building and graph traversal stage.
> > All will be implemented using Gremlin syntax.
> > However, this is very unfriendly to users who are not familiar with
> gremlin
> > syntax. While calcite exists as a query framework,
> > It also provides an adapter interface to adapt to different database
> > dialects, such as parsing, relational algebra conversion, and query plan
> > binding.
> > Our company has solved the problem of adapting various graph databases.
> > This is my warehouse:
> https://github.com/kaori-seasons/calcite-gremlin-sql
> >
> >
> > Background
> >
> > Calcite itself supports the database language expansion of the adapter,
> > which enables users to understand the meaning of the grammar.
> > Complete the simplification of the dialect. For example, expand SqlNode
> to
> > implement syntax analysis, and expand RelNode to implement logical plan
> > mapping.
> >
> > thinkerpop is an adaptation framework for various graph databases. In
> this
> > framework, gremlin syntax is mentioned for the first time.
> > It has now become a universal query layer for graph databases. While
> > expanding query statements through calcite’s adapter interface,
> > We will also use thinkerpop's universal graph database API to provide
> > dialect compatibility for different graph databases.
> >
> > Give a simple example:
> > From
> >
> > SELECT "key" FROM inttype
> >
> >  maps to
> >
> >
> >
> g.V().hasLabel("inttype").group().unfold().select(Column.values).order().by(_.unfold().id()).project("inttype").
> >
> >
> by(.project("key").by(.unfold().choose(.has("key"),.values("key"),_.constant("\$%#NULL#
> > %\$"))))
> >
> >
> >
> >
> >
> > The design architecture is divided into three layers.
> >
> > Analytical syntax layer, relational algebra transformation layer, logical
> > plan binding layer.
> >
> > Parsing syntax layer: In the parsing query statement phase, fields and
> > equivalent conditions are split and converted into points and edges.
> >
> > Relational algebra layer: Split it into a collection of points and edges,
> > and convert it into a TableScan during the aggregation/sorting/query
> stage
> > where calcite abstracts it.
> > It is convenient to generate query plans based on conditions and field
> > information.
> > Connection scanning/single table filtering and other methods can be used
> to
> > traverse from any edge/any starting point in the graph
> >
> > Logical plan binding layer: Bind behaviors such as connection
> > scanning/single table filtering/projection to calcite’s planner to
> generate
> > query plans.
> >
> > The process of generating Gremlin logical plan using select statement:
> >
> > 1. First of all, all graph databases start from a source point to build
> the
> > graph. We will use the GraphTraversalSource provided by thinkerpop.
> > As the origin, extract the syntax of the incoming point and side
> > information. This step will be implemented in SqlSchemaGrabber
> > 2. Secondly, for select/where/having/order by/group by our plan in the
> > parsing phase is as follows:
> >
> >   - group by: for a point. There are out-degree and in-degree
> > attributes in the graph. From the perspective of the data table, it is
> > equivalent to converting the table data into IN or OUT.
> > These two dimensions are aggregated. This behavior also corresponds to
> > the table traversal graph operation. During the graph traversal
> > process, fold/unfold tags will be generated, representing the
> > direction of graph traversal.
> >   - select: For the select operation, the operation of scanning the
> > entire table can be regarded as projecting all columns into point
> > attributes. The value changes of each column are mapped to the gremlin
> > operation of adding points.
> >   - where: The filter operation is used in graph computing semantic
> > scenarios. It can be regarded as the edges connected by the out-degree
> > and in-degree of the filter point, so it does not involve the
> > conversion of relational algebra.
> >   Instead, it is pushed directly to the logical plan.
> >   - order by: In the process of graph traversal, we mentioned that
> > fold/unflod will be generated on the path to represent the
> > forward/backward direction.
> >   If a field is encountered and there is no value that can be sorted,
> > we will fall back to the origin of GraphTraversalSource and end the
> > sorting operation.
> >   If there are values that can be sorted, they will be unified in
> > SqlTraversalEngine, in-degree and out-degree will be counted
> > separately for aggregation, and then used with group by according to
> > label (IN/OUT)
> >   - having: The meaning is the same as group by, but the label is
> > different (in addition to the IN/OUT columns, specific point fields
> > need to be specified)
> >
> >  Currently, I have only completed unit tests that translate from SQL to
> > Gremlin execution plan, among which test cases for group by and where are
> > to be added. In addition, I will also use mainstream graph databases such
> > as neo4j and JanusGraph as examples to write sound integration tests. ,
> > ensuring that the API of the graph database is successfully called after
> > converting the sql request into the gremlin execution plan.
> >
> > Finally, community members are welcome to give suggestions.
> >
>

Re: Feature: Support gremlin adapter

Posted by 柳尘 <yu...@gmail.com>.
Hello Julian Hyde, I just saw your reply now.

I understand that the second implementation method of gremlin syntax
mentioned in your reply should be related to the function of dialect
conversion. This proposal aims to provide some operations based on point
tables and edge tables for users who are not used to writing Gremlin sql.
For example Creation, traversal, association, and aggregation methods. With
the help of calcite's syntax analysis and query plan push-down, the user's
SQL is converted into Gremlin API and handed over to the specified graph
data for execution. To sum up, I think this is a Gremlin Adapter

Julian Hyde <jh...@apache.org> 于2024年1月4日周四 05:16写道:

> I had a similar question to Mihai.
>
> After your change, would Calcite be able to translate a user's SQL
> query and execute it against an existing implementation of the Gremlin
> API (say neo4j), or would users be able to talk to Calcite in the
> Gremlin API and have Calcite convert that, via relational algebra, to
> SQL (or whatever language the back-end speaks).
>
> The first of these would be called a 'Gremlin adapter', and the second
> would be called a 'Gremlin front-end'.
>
> Julian
>
> On Tue, Dec 26, 2023 at 5:17 PM 柳尘 <yu...@gmail.com> wrote:
> >
> > Hello Mihai Budiu,I'm sorry for the late reply because the time zone
> > relationship is inconsistent. I am recently trying to integrate the code
> in
> > the warehouse into the calcite source code project, and there are some
> > dependency conflicts that need to be resolved.
> >
> > Regarding your question about the unclear description of my
> contribution, I
> > am working hard to improve it. Simply put, I plan to implement the
> relevant
> > adapter for adapting Gremlin syntax (
> > https://calcite.apache.org/docs/adapter.html) Module. Because Gremlin
> is a
> > common syntax for many graph databases, in the description of the above
> > content, I spent a lot of space introducing the background of this
> syntax.
> > After that, I will open a pull request and reply to the mailing list for
> > specific changes.
> >
> > Mihai Budiu <mb...@gmail.com> 于2023年12月27日周三 03:02写道:
> >
> > > I cannot really figure out from this description what you plan to
> > > contribute.
> > > Can you write a one paragraph description, perhaps a list of bullet
> points?
> > > Something like: "parser for Gremlin dialect, converter from Gremlin IR
> to
> > > Calcite IR, etc."?
> > >
> > > Mihai
> > > ________________________________
> > > From: 柳尘 <yu...@gmail.com>
> > > Sent: Monday, December 25, 2023 3:57 PM
> > > To: dev@calcite.apache.org <de...@calcite.apache.org>
> > > Subject: Re: Feature: Support gremlin adapter
> > >
> > > Thank you, thank you very much for your reply. Before I open a new PR
> for
> > > this requirement, obviously I need to write some test cases to
> illustrate.
> > > Currently there are some test cases in the repository but no related
> > > integration tests. This work is in progress .
> > >
> > > As an example:
> > > As shown below, there is a representation of a point set and an edge
> set.
> > > You can see that an edge is represented by an out-degree node, an
> in-degree
> > > node, and a label.
> > >
> > > {
> > >     "tables": [
> > >
> > > { "name": "company", "columns": [ \{"name": "name", "type": "string"}
> > >         ]
> > >       },
> > >
> > > { "name": "country", "columns": [ \{"name": "name", "type": "string"}
> > >         ]
> > >       },
> > >
> > > { "name": "planet", "columns": [ \{"name": "name", "type": "string"}
> > >         ]
> > >       },
> > >
> > > { "name": "person", "columns": [ \{"name": "name", "type": "string"}
> > > ,
> > >           {"name": "age", "type": "integer"}
> > >         ]
> > >       },
> > >
> > > { "name": "spaceship", "columns": [ \{"name": "name", "type": "string"}
> > > ,
> > >           {"name": "model", "type": "string"}
> > >         ]
> > >       },
> > >
> > > { "name": "satellite", "columns": [ \{"name": "name", "type": "string"}
> > >         ]
> > >       },
> > >
> > > { "name": "sensor", "columns": [ \{"name": "name", "type": "string"}
> > > ,
> > >           {"name": "type", "type": "string"}
> > >         ]
> > >       },
> > >
> > > { "name": "sensorReading", "columns": [ \{"name": "tstamp", "type":
> > > "long_timestamp", "propertyName": "timestamp"}
> > > ,
> > >           {"name": "dt", "type": "long_date", "propertyName": "date"},
> > >           {"name": "value", "type": "double"}
> > >         ]
> > >       },
> > >
> > > { "name": "fliesTo", "columns":[ \{"name": "trips", "type": "integer"}
> > >         ]
> > >       },
> > >
> > > { "name": "orbits", "columns":[ \{"name": "launched", "type":
> "integer"}
> > >         ]
> > >       }
> > >       ],
> > >     "relationships": [
> > >       {"outTable": "company", "inTable": "country", "edgeLabel":
> "baseIn"},
> > >       {"outTable": "person", "inTable": "company", "edgeLabel":
> > > "worksFor"},
> > >       {"outTable": "person", "inTable": "planets", "edgeLabel":
> > > "travelledTo"},
> > >       {"outTable": "company", "inTable": "spaceship", "edgeLabel":
> "owns"},
> > >       {"outTable": "person", "inTable": "spaceship", "edgeLabel":
> > > "pilots"},
> > >       {"outTable": "sensor", "inTable": "sensorReading", "edgeLabel":
> > > "hasReading", "fkTable": "sensorReading"},
> > >       {"outTable": "person", "inTable": "planet", "edgeLabel":
> "fliesTo"},
> > >       {"outTable": "satellite", "inTable": "planet", "edgeLabel":
> > > "orbits"},
> > >       {"outTable": "person", "inTable": "person", "edgeLabel":
> > > "friendsWith"}
> > >     ]
> > > }
> > >
> > > Please allow me to introduce some basic grammar concepts of Gremlin
> first
> > >
> > > - V(): Query vertices, for example: g.V(), g.V(‘v_id’), query all
> points
> > > and specific points
> > > - E(): query edge, there are many types of statements that can be
> continued
> > > later
> > > - id(): Get the id of vertices and edges. Example: g.V().id(), query
> the
> > > ids of all vertices
> > > - label(): Get the labels of vertices and edges. Example:
> g.V().label(),
> > > you can query the labels of all vertices
> > >
> > > In the traversal operation of the graph database, it is mainly divided
> into
> > > the following operations:
> > >
> > > 1. Based on the "vertex"
> > >
> > >     1) out(label): Access the [vertex's OUT direction adjacent points]
> > > according to the specified Edge Label (zero Edge Label, representing
> all
> > > types of edges; it can also be one or more Edge Labels, representing
> any
> > > given Edge Label Side, the same below);
> > >
> > >      2) in(label): Access [the IN direction adjacent points of the
> vertex]
> > > according to the specified Edge Label;
> > >
> > >      3) both(label): Access the [bidirectional adjacent points of the
> > > vertex] according to the specified Edge Label;
> > >
> > >      4) outE(label): Access [the OUT direction adjacent edge of the
> vertex]
> > > according to the specified Edge Label;
> > >
> > >       5) inE(label): Access [the IN direction adjacent edge of the
> vertex]
> > > according to the specified Edge Label;
> > >
> > >       6). bothE(label): Access [the two-way adjacent edges of the
> vertex]
> > > according to the specified Edge Label;
> > >
> > >
> > > 2. Based on “edge”
> > >     1) outV(): access [the outgoing vertex of the edge], which refers
> to
> > > the [starting vertex of the edge];
> > >
> > >      2) inV(): access [the incoming vertex of the edge], the incoming
> > > vertex refers to the [target vertex of the edge], the vertex pointed
> by the
> > > arrow;
> > >
> > >      3) bothV(): access [bidirectional vertices of edges];
> > >
> > >      4) otherV(): access the [partner vertex of the edge], that is, the
> > > vertex at the other end relative to the base vertex;
> > >
> > >
> > > Let us classify and discuss in terms of points and edges respectively,
> > >
> > > If the user chooses to traverse based on points, they need to query and
> > > perform aggregation operations based on Edge Label. In the above
> example,
> > > if the user needs to count the in-degree and out-degree (both types)
> based
> > > on company, then the upper layer uses select count(1) from company can
> get
> > > the desired query results (abstracting GremlinSqlSelectSingle by
> rewriting
> > > graph traversal operations, such as GroupBy/select, OrderBy,
> Having/Where
> > > and other sql operators to ensure the generation of execution plans).
> For
> > > those who only want to count out-degrees Or if you can only count the
> > > in-degree results, it means that you need to join the table operation.
> In
> > > other words, if you are counting the out-degree of the company table
> to the
> > > country table. Using sql, it is expressed as select count(1) from
> company
> > > left join country on company.id = country.id. This corresponds to
> > > GremlinSqlSelectMulti in the code. It will abstract an assembly class
> > > (GremlinVertexTable) of a point collection table, including inEdges and
> > > outEdges two sets, pre-traverse the out-degree edges and in-degree
> edges
> > > that match the label and put them into sqlMetadata for caching. When
> the
> > > user queries, take out the edge set that meets the conditions from
> > > sqlMetadata and push down the execution plan.
> > >
> > > If the user chooses to traverse based on edges, they need to consider
> > > aggregating using an attribute in the company's point collection as the
> > > dimension under the condition that the query is based on the Edge
> Label.
> > > That is, the name field in the columns collection. You can write select
> > > count(1) from company group by name to count out-degree and in-degree
> (both
> > > types).
> > >
> > >
> > > This made me realize that I need to write a more complete document.
> For any
> > > questions about the above, friends in the community are welcome to ask
> > > questions and add.
> > >
> > > Best wishes to you
> > >
> > > Benchao Li <li...@apache.org> 于2023年12月25日周一 22:48写道:
> > >
> > > > This sounds very interesting to me, although I'm not familiar with
> > > > graph databases.
> > > >
> > > > I'm more interested in how to represent graph data structure and
> graph
> > > > query in SQL, is there any relational database/SQL query engine that
> > > > has done this before? Do we need to add some special data
> > > > types/operators for graph processing?
> > > >
> > > > Forward Xu <fo...@gmail.com> 于2023年12月25日周一 09:51写道:
> > > > >
> > > > > hi
> > > > >
> > > > > This is a great feature to extend calcite from regular data
> queries to
> > > > > graph queries (calculations),
> > > > > +1 for looking forward to it.
> > > > >
> > > > > forwardxu
> > > > >
> > > > > 柳尘 <yu...@gmail.com> 于2023年12月24日周日 11:20写道:
> > > > >
> > > > > > Motivation
> > > > > >
> > > > > > Hi, community. Currently, more and more users are using some
> graph
> > > > > > databases, such as JanusGraph, HugeGraph, etc.
> > > > > > To do some relationship representation of personnel social
> networks,
> > > > > > It is used to count the activity of each user in the social
> network.
> > > > Most
> > > > > > graph databases are in the graph building and graph traversal
> stage.
> > > > > > All will be implemented using Gremlin syntax.
> > > > > > However, this is very unfriendly to users who are not familiar
> with
> > > > gremlin
> > > > > > syntax. While calcite exists as a query framework,
> > > > > > It also provides an adapter interface to adapt to different
> database
> > > > > > dialects, such as parsing, relational algebra conversion, and
> query
> > > > plan
> > > > > > binding.
> > > > > > Our company has solved the problem of adapting various graph
> > > databases.
> > > > > > This is my warehouse:
> > > > https://github.com/kaori-seasons/calcite-gremlin-sql
> > > > > >
> > > > > >
> > > > > > Background
> > > > > >
> > > > > > Calcite itself supports the database language expansion of the
> > > adapter,
> > > > > > which enables users to understand the meaning of the grammar.
> > > > > > Complete the simplification of the dialect. For example, expand
> > > > SqlNode to
> > > > > > implement syntax analysis, and expand RelNode to implement
> logical
> > > plan
> > > > > > mapping.
> > > > > >
> > > > > > thinkerpop is an adaptation framework for various graph
> databases. In
> > > > this
> > > > > > framework, gremlin syntax is mentioned for the first time.
> > > > > > It has now become a universal query layer for graph databases.
> While
> > > > > > expanding query statements through calcite’s adapter interface,
> > > > > > We will also use thinkerpop's universal graph database API to
> provide
> > > > > > dialect compatibility for different graph databases.
> > > > > >
> > > > > > Give a simple example:
> > > > > > From
> > > > > >
> > > > > > SELECT "key" FROM inttype
> > > > > >
> > > > > >  maps to
> > > > > >
> > > > > >
> > > > > >
> > > >
> > >
> g.V().hasLabel("inttype").group().unfold().select(Column.values).order().by(_.unfold().id()).project("inttype").
> > > > > >
> > > > > >
> > > >
> > >
> by(.project("key").by(.unfold().choose(.has("key"),.values("key"),_.constant("\$%#NULL#
> > > > > > %\$"))))
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > The design architecture is divided into three layers.
> > > > > >
> > > > > > Analytical syntax layer, relational algebra transformation layer,
> > > > logical
> > > > > > plan binding layer.
> > > > > >
> > > > > > Parsing syntax layer: In the parsing query statement phase,
> fields
> > > and
> > > > > > equivalent conditions are split and converted into points and
> edges.
> > > > > >
> > > > > > Relational algebra layer: Split it into a collection of points
> and
> > > > edges,
> > > > > > and convert it into a TableScan during the
> aggregation/sorting/query
> > > > stage
> > > > > > where calcite abstracts it.
> > > > > > It is convenient to generate query plans based on conditions and
> > > field
> > > > > > information.
> > > > > > Connection scanning/single table filtering and other methods can
> be
> > > > used to
> > > > > > traverse from any edge/any starting point in the graph
> > > > > >
> > > > > > Logical plan binding layer: Bind behaviors such as connection
> > > > > > scanning/single table filtering/projection to calcite’s planner
> to
> > > > generate
> > > > > > query plans.
> > > > > >
> > > > > > The process of generating Gremlin logical plan using select
> > > statement:
> > > > > >
> > > > > > 1. First of all, all graph databases start from a source point to
> > > > build the
> > > > > > graph. We will use the GraphTraversalSource provided by
> thinkerpop.
> > > > > > As the origin, extract the syntax of the incoming point and side
> > > > > > information. This step will be implemented in SqlSchemaGrabber
> > > > > > 2. Secondly, for select/where/having/order by/group by our plan
> in
> > > the
> > > > > > parsing phase is as follows:
> > > > > >
> > > > > >   - group by: for a point. There are out-degree and in-degree
> > > > > > attributes in the graph. From the perspective of the data table,
> it
> > > is
> > > > > > equivalent to converting the table data into IN or OUT.
> > > > > > These two dimensions are aggregated. This behavior also
> corresponds
> > > to
> > > > > > the table traversal graph operation. During the graph traversal
> > > > > > process, fold/unfold tags will be generated, representing the
> > > > > > direction of graph traversal.
> > > > > >   - select: For the select operation, the operation of scanning
> the
> > > > > > entire table can be regarded as projecting all columns into point
> > > > > > attributes. The value changes of each column are mapped to the
> > > gremlin
> > > > > > operation of adding points.
> > > > > >   - where: The filter operation is used in graph computing
> semantic
> > > > > > scenarios. It can be regarded as the edges connected by the
> > > out-degree
> > > > > > and in-degree of the filter point, so it does not involve the
> > > > > > conversion of relational algebra.
> > > > > >   Instead, it is pushed directly to the logical plan.
> > > > > >   - order by: In the process of graph traversal, we mentioned
> that
> > > > > > fold/unflod will be generated on the path to represent the
> > > > > > forward/backward direction.
> > > > > >   If a field is encountered and there is no value that can be
> sorted,
> > > > > > we will fall back to the origin of GraphTraversalSource and end
> the
> > > > > > sorting operation.
> > > > > >   If there are values that can be sorted, they will be unified in
> > > > > > SqlTraversalEngine, in-degree and out-degree will be counted
> > > > > > separately for aggregation, and then used with group by
> according to
> > > > > > label (IN/OUT)
> > > > > >   - having: The meaning is the same as group by, but the label is
> > > > > > different (in addition to the IN/OUT columns, specific point
> fields
> > > > > > need to be specified)
> > > > > >
> > > > > >  Currently, I have only completed unit tests that translate from
> SQL
> > > to
> > > > > > Gremlin execution plan, among which test cases for group by and
> where
> > > > are
> > > > > > to be added. In addition, I will also use mainstream graph
> databases
> > > > such
> > > > > > as neo4j and JanusGraph as examples to write sound integration
> > > tests. ,
> > > > > > ensuring that the API of the graph database is successfully
> called
> > > > after
> > > > > > converting the sql request into the gremlin execution plan.
> > > > > >
> > > > > > Finally, community members are welcome to give suggestions.
> > > > > >
> > > >
> > > >
> > > >
> > > > --
> > > >
> > > > Best,
> > > > Benchao Li
> > > >
> > >
>

Re: Feature: Support gremlin adapter

Posted by Julian Hyde <jh...@apache.org>.
I had a similar question to Mihai.

After your change, would Calcite be able to translate a user's SQL
query and execute it against an existing implementation of the Gremlin
API (say neo4j), or would users be able to talk to Calcite in the
Gremlin API and have Calcite convert that, via relational algebra, to
SQL (or whatever language the back-end speaks).

The first of these would be called a 'Gremlin adapter', and the second
would be called a 'Gremlin front-end'.

Julian

On Tue, Dec 26, 2023 at 5:17 PM 柳尘 <yu...@gmail.com> wrote:
>
> Hello Mihai Budiu,I'm sorry for the late reply because the time zone
> relationship is inconsistent. I am recently trying to integrate the code in
> the warehouse into the calcite source code project, and there are some
> dependency conflicts that need to be resolved.
>
> Regarding your question about the unclear description of my contribution, I
> am working hard to improve it. Simply put, I plan to implement the relevant
> adapter for adapting Gremlin syntax (
> https://calcite.apache.org/docs/adapter.html) Module. Because Gremlin is a
> common syntax for many graph databases, in the description of the above
> content, I spent a lot of space introducing the background of this syntax.
> After that, I will open a pull request and reply to the mailing list for
> specific changes.
>
> Mihai Budiu <mb...@gmail.com> 于2023年12月27日周三 03:02写道:
>
> > I cannot really figure out from this description what you plan to
> > contribute.
> > Can you write a one paragraph description, perhaps a list of bullet points?
> > Something like: "parser for Gremlin dialect, converter from Gremlin IR to
> > Calcite IR, etc."?
> >
> > Mihai
> > ________________________________
> > From: 柳尘 <yu...@gmail.com>
> > Sent: Monday, December 25, 2023 3:57 PM
> > To: dev@calcite.apache.org <de...@calcite.apache.org>
> > Subject: Re: Feature: Support gremlin adapter
> >
> > Thank you, thank you very much for your reply. Before I open a new PR for
> > this requirement, obviously I need to write some test cases to illustrate.
> > Currently there are some test cases in the repository but no related
> > integration tests. This work is in progress .
> >
> > As an example:
> > As shown below, there is a representation of a point set and an edge set.
> > You can see that an edge is represented by an out-degree node, an in-degree
> > node, and a label.
> >
> > {
> >     "tables": [
> >
> > { "name": "company", "columns": [ \{"name": "name", "type": "string"}
> >         ]
> >       },
> >
> > { "name": "country", "columns": [ \{"name": "name", "type": "string"}
> >         ]
> >       },
> >
> > { "name": "planet", "columns": [ \{"name": "name", "type": "string"}
> >         ]
> >       },
> >
> > { "name": "person", "columns": [ \{"name": "name", "type": "string"}
> > ,
> >           {"name": "age", "type": "integer"}
> >         ]
> >       },
> >
> > { "name": "spaceship", "columns": [ \{"name": "name", "type": "string"}
> > ,
> >           {"name": "model", "type": "string"}
> >         ]
> >       },
> >
> > { "name": "satellite", "columns": [ \{"name": "name", "type": "string"}
> >         ]
> >       },
> >
> > { "name": "sensor", "columns": [ \{"name": "name", "type": "string"}
> > ,
> >           {"name": "type", "type": "string"}
> >         ]
> >       },
> >
> > { "name": "sensorReading", "columns": [ \{"name": "tstamp", "type":
> > "long_timestamp", "propertyName": "timestamp"}
> > ,
> >           {"name": "dt", "type": "long_date", "propertyName": "date"},
> >           {"name": "value", "type": "double"}
> >         ]
> >       },
> >
> > { "name": "fliesTo", "columns":[ \{"name": "trips", "type": "integer"}
> >         ]
> >       },
> >
> > { "name": "orbits", "columns":[ \{"name": "launched", "type": "integer"}
> >         ]
> >       }
> >       ],
> >     "relationships": [
> >       {"outTable": "company", "inTable": "country", "edgeLabel": "baseIn"},
> >       {"outTable": "person", "inTable": "company", "edgeLabel":
> > "worksFor"},
> >       {"outTable": "person", "inTable": "planets", "edgeLabel":
> > "travelledTo"},
> >       {"outTable": "company", "inTable": "spaceship", "edgeLabel": "owns"},
> >       {"outTable": "person", "inTable": "spaceship", "edgeLabel":
> > "pilots"},
> >       {"outTable": "sensor", "inTable": "sensorReading", "edgeLabel":
> > "hasReading", "fkTable": "sensorReading"},
> >       {"outTable": "person", "inTable": "planet", "edgeLabel": "fliesTo"},
> >       {"outTable": "satellite", "inTable": "planet", "edgeLabel":
> > "orbits"},
> >       {"outTable": "person", "inTable": "person", "edgeLabel":
> > "friendsWith"}
> >     ]
> > }
> >
> > Please allow me to introduce some basic grammar concepts of Gremlin first
> >
> > - V(): Query vertices, for example: g.V(), g.V(‘v_id’), query all points
> > and specific points
> > - E(): query edge, there are many types of statements that can be continued
> > later
> > - id(): Get the id of vertices and edges. Example: g.V().id(), query the
> > ids of all vertices
> > - label(): Get the labels of vertices and edges. Example: g.V().label(),
> > you can query the labels of all vertices
> >
> > In the traversal operation of the graph database, it is mainly divided into
> > the following operations:
> >
> > 1. Based on the "vertex"
> >
> >     1) out(label): Access the [vertex's OUT direction adjacent points]
> > according to the specified Edge Label (zero Edge Label, representing all
> > types of edges; it can also be one or more Edge Labels, representing any
> > given Edge Label Side, the same below);
> >
> >      2) in(label): Access [the IN direction adjacent points of the vertex]
> > according to the specified Edge Label;
> >
> >      3) both(label): Access the [bidirectional adjacent points of the
> > vertex] according to the specified Edge Label;
> >
> >      4) outE(label): Access [the OUT direction adjacent edge of the vertex]
> > according to the specified Edge Label;
> >
> >       5) inE(label): Access [the IN direction adjacent edge of the vertex]
> > according to the specified Edge Label;
> >
> >       6). bothE(label): Access [the two-way adjacent edges of the vertex]
> > according to the specified Edge Label;
> >
> >
> > 2. Based on “edge”
> >     1) outV(): access [the outgoing vertex of the edge], which refers to
> > the [starting vertex of the edge];
> >
> >      2) inV(): access [the incoming vertex of the edge], the incoming
> > vertex refers to the [target vertex of the edge], the vertex pointed by the
> > arrow;
> >
> >      3) bothV(): access [bidirectional vertices of edges];
> >
> >      4) otherV(): access the [partner vertex of the edge], that is, the
> > vertex at the other end relative to the base vertex;
> >
> >
> > Let us classify and discuss in terms of points and edges respectively,
> >
> > If the user chooses to traverse based on points, they need to query and
> > perform aggregation operations based on Edge Label. In the above example,
> > if the user needs to count the in-degree and out-degree (both types) based
> > on company, then the upper layer uses select count(1) from company can get
> > the desired query results (abstracting GremlinSqlSelectSingle by rewriting
> > graph traversal operations, such as GroupBy/select, OrderBy, Having/Where
> > and other sql operators to ensure the generation of execution plans). For
> > those who only want to count out-degrees Or if you can only count the
> > in-degree results, it means that you need to join the table operation. In
> > other words, if you are counting the out-degree of the company table to the
> > country table. Using sql, it is expressed as select count(1) from company
> > left join country on company.id = country.id. This corresponds to
> > GremlinSqlSelectMulti in the code. It will abstract an assembly class
> > (GremlinVertexTable) of a point collection table, including inEdges and
> > outEdges two sets, pre-traverse the out-degree edges and in-degree edges
> > that match the label and put them into sqlMetadata for caching. When the
> > user queries, take out the edge set that meets the conditions from
> > sqlMetadata and push down the execution plan.
> >
> > If the user chooses to traverse based on edges, they need to consider
> > aggregating using an attribute in the company's point collection as the
> > dimension under the condition that the query is based on the Edge Label.
> > That is, the name field in the columns collection. You can write select
> > count(1) from company group by name to count out-degree and in-degree (both
> > types).
> >
> >
> > This made me realize that I need to write a more complete document. For any
> > questions about the above, friends in the community are welcome to ask
> > questions and add.
> >
> > Best wishes to you
> >
> > Benchao Li <li...@apache.org> 于2023年12月25日周一 22:48写道:
> >
> > > This sounds very interesting to me, although I'm not familiar with
> > > graph databases.
> > >
> > > I'm more interested in how to represent graph data structure and graph
> > > query in SQL, is there any relational database/SQL query engine that
> > > has done this before? Do we need to add some special data
> > > types/operators for graph processing?
> > >
> > > Forward Xu <fo...@gmail.com> 于2023年12月25日周一 09:51写道:
> > > >
> > > > hi
> > > >
> > > > This is a great feature to extend calcite from regular data queries to
> > > > graph queries (calculations),
> > > > +1 for looking forward to it.
> > > >
> > > > forwardxu
> > > >
> > > > 柳尘 <yu...@gmail.com> 于2023年12月24日周日 11:20写道:
> > > >
> > > > > Motivation
> > > > >
> > > > > Hi, community. Currently, more and more users are using some graph
> > > > > databases, such as JanusGraph, HugeGraph, etc.
> > > > > To do some relationship representation of personnel social networks,
> > > > > It is used to count the activity of each user in the social network.
> > > Most
> > > > > graph databases are in the graph building and graph traversal stage.
> > > > > All will be implemented using Gremlin syntax.
> > > > > However, this is very unfriendly to users who are not familiar with
> > > gremlin
> > > > > syntax. While calcite exists as a query framework,
> > > > > It also provides an adapter interface to adapt to different database
> > > > > dialects, such as parsing, relational algebra conversion, and query
> > > plan
> > > > > binding.
> > > > > Our company has solved the problem of adapting various graph
> > databases.
> > > > > This is my warehouse:
> > > https://github.com/kaori-seasons/calcite-gremlin-sql
> > > > >
> > > > >
> > > > > Background
> > > > >
> > > > > Calcite itself supports the database language expansion of the
> > adapter,
> > > > > which enables users to understand the meaning of the grammar.
> > > > > Complete the simplification of the dialect. For example, expand
> > > SqlNode to
> > > > > implement syntax analysis, and expand RelNode to implement logical
> > plan
> > > > > mapping.
> > > > >
> > > > > thinkerpop is an adaptation framework for various graph databases. In
> > > this
> > > > > framework, gremlin syntax is mentioned for the first time.
> > > > > It has now become a universal query layer for graph databases. While
> > > > > expanding query statements through calcite’s adapter interface,
> > > > > We will also use thinkerpop's universal graph database API to provide
> > > > > dialect compatibility for different graph databases.
> > > > >
> > > > > Give a simple example:
> > > > > From
> > > > >
> > > > > SELECT "key" FROM inttype
> > > > >
> > > > >  maps to
> > > > >
> > > > >
> > > > >
> > >
> > g.V().hasLabel("inttype").group().unfold().select(Column.values).order().by(_.unfold().id()).project("inttype").
> > > > >
> > > > >
> > >
> > by(.project("key").by(.unfold().choose(.has("key"),.values("key"),_.constant("\$%#NULL#
> > > > > %\$"))))
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > The design architecture is divided into three layers.
> > > > >
> > > > > Analytical syntax layer, relational algebra transformation layer,
> > > logical
> > > > > plan binding layer.
> > > > >
> > > > > Parsing syntax layer: In the parsing query statement phase, fields
> > and
> > > > > equivalent conditions are split and converted into points and edges.
> > > > >
> > > > > Relational algebra layer: Split it into a collection of points and
> > > edges,
> > > > > and convert it into a TableScan during the aggregation/sorting/query
> > > stage
> > > > > where calcite abstracts it.
> > > > > It is convenient to generate query plans based on conditions and
> > field
> > > > > information.
> > > > > Connection scanning/single table filtering and other methods can be
> > > used to
> > > > > traverse from any edge/any starting point in the graph
> > > > >
> > > > > Logical plan binding layer: Bind behaviors such as connection
> > > > > scanning/single table filtering/projection to calcite’s planner to
> > > generate
> > > > > query plans.
> > > > >
> > > > > The process of generating Gremlin logical plan using select
> > statement:
> > > > >
> > > > > 1. First of all, all graph databases start from a source point to
> > > build the
> > > > > graph. We will use the GraphTraversalSource provided by thinkerpop.
> > > > > As the origin, extract the syntax of the incoming point and side
> > > > > information. This step will be implemented in SqlSchemaGrabber
> > > > > 2. Secondly, for select/where/having/order by/group by our plan in
> > the
> > > > > parsing phase is as follows:
> > > > >
> > > > >   - group by: for a point. There are out-degree and in-degree
> > > > > attributes in the graph. From the perspective of the data table, it
> > is
> > > > > equivalent to converting the table data into IN or OUT.
> > > > > These two dimensions are aggregated. This behavior also corresponds
> > to
> > > > > the table traversal graph operation. During the graph traversal
> > > > > process, fold/unfold tags will be generated, representing the
> > > > > direction of graph traversal.
> > > > >   - select: For the select operation, the operation of scanning the
> > > > > entire table can be regarded as projecting all columns into point
> > > > > attributes. The value changes of each column are mapped to the
> > gremlin
> > > > > operation of adding points.
> > > > >   - where: The filter operation is used in graph computing semantic
> > > > > scenarios. It can be regarded as the edges connected by the
> > out-degree
> > > > > and in-degree of the filter point, so it does not involve the
> > > > > conversion of relational algebra.
> > > > >   Instead, it is pushed directly to the logical plan.
> > > > >   - order by: In the process of graph traversal, we mentioned that
> > > > > fold/unflod will be generated on the path to represent the
> > > > > forward/backward direction.
> > > > >   If a field is encountered and there is no value that can be sorted,
> > > > > we will fall back to the origin of GraphTraversalSource and end the
> > > > > sorting operation.
> > > > >   If there are values that can be sorted, they will be unified in
> > > > > SqlTraversalEngine, in-degree and out-degree will be counted
> > > > > separately for aggregation, and then used with group by according to
> > > > > label (IN/OUT)
> > > > >   - having: The meaning is the same as group by, but the label is
> > > > > different (in addition to the IN/OUT columns, specific point fields
> > > > > need to be specified)
> > > > >
> > > > >  Currently, I have only completed unit tests that translate from SQL
> > to
> > > > > Gremlin execution plan, among which test cases for group by and where
> > > are
> > > > > to be added. In addition, I will also use mainstream graph databases
> > > such
> > > > > as neo4j and JanusGraph as examples to write sound integration
> > tests. ,
> > > > > ensuring that the API of the graph database is successfully called
> > > after
> > > > > converting the sql request into the gremlin execution plan.
> > > > >
> > > > > Finally, community members are welcome to give suggestions.
> > > > >
> > >
> > >
> > >
> > > --
> > >
> > > Best,
> > > Benchao Li
> > >
> >

Re: Feature: Support gremlin adapter

Posted by 柳尘 <yu...@gmail.com>.
Hello Mihai Budiu,I'm sorry for the late reply because the time zone
relationship is inconsistent. I am recently trying to integrate the code in
the warehouse into the calcite source code project, and there are some
dependency conflicts that need to be resolved.

Regarding your question about the unclear description of my contribution, I
am working hard to improve it. Simply put, I plan to implement the relevant
adapter for adapting Gremlin syntax (
https://calcite.apache.org/docs/adapter.html) Module. Because Gremlin is a
common syntax for many graph databases, in the description of the above
content, I spent a lot of space introducing the background of this syntax.
After that, I will open a pull request and reply to the mailing list for
specific changes.

Mihai Budiu <mb...@gmail.com> 于2023年12月27日周三 03:02写道:

> I cannot really figure out from this description what you plan to
> contribute.
> Can you write a one paragraph description, perhaps a list of bullet points?
> Something like: "parser for Gremlin dialect, converter from Gremlin IR to
> Calcite IR, etc."?
>
> Mihai
> ________________________________
> From: 柳尘 <yu...@gmail.com>
> Sent: Monday, December 25, 2023 3:57 PM
> To: dev@calcite.apache.org <de...@calcite.apache.org>
> Subject: Re: Feature: Support gremlin adapter
>
> Thank you, thank you very much for your reply. Before I open a new PR for
> this requirement, obviously I need to write some test cases to illustrate.
> Currently there are some test cases in the repository but no related
> integration tests. This work is in progress .
>
> As an example:
> As shown below, there is a representation of a point set and an edge set.
> You can see that an edge is represented by an out-degree node, an in-degree
> node, and a label.
>
> {
>     "tables": [
>
> { "name": "company", "columns": [ \{"name": "name", "type": "string"}
>         ]
>       },
>
> { "name": "country", "columns": [ \{"name": "name", "type": "string"}
>         ]
>       },
>
> { "name": "planet", "columns": [ \{"name": "name", "type": "string"}
>         ]
>       },
>
> { "name": "person", "columns": [ \{"name": "name", "type": "string"}
> ,
>           {"name": "age", "type": "integer"}
>         ]
>       },
>
> { "name": "spaceship", "columns": [ \{"name": "name", "type": "string"}
> ,
>           {"name": "model", "type": "string"}
>         ]
>       },
>
> { "name": "satellite", "columns": [ \{"name": "name", "type": "string"}
>         ]
>       },
>
> { "name": "sensor", "columns": [ \{"name": "name", "type": "string"}
> ,
>           {"name": "type", "type": "string"}
>         ]
>       },
>
> { "name": "sensorReading", "columns": [ \{"name": "tstamp", "type":
> "long_timestamp", "propertyName": "timestamp"}
> ,
>           {"name": "dt", "type": "long_date", "propertyName": "date"},
>           {"name": "value", "type": "double"}
>         ]
>       },
>
> { "name": "fliesTo", "columns":[ \{"name": "trips", "type": "integer"}
>         ]
>       },
>
> { "name": "orbits", "columns":[ \{"name": "launched", "type": "integer"}
>         ]
>       }
>       ],
>     "relationships": [
>       {"outTable": "company", "inTable": "country", "edgeLabel": "baseIn"},
>       {"outTable": "person", "inTable": "company", "edgeLabel":
> "worksFor"},
>       {"outTable": "person", "inTable": "planets", "edgeLabel":
> "travelledTo"},
>       {"outTable": "company", "inTable": "spaceship", "edgeLabel": "owns"},
>       {"outTable": "person", "inTable": "spaceship", "edgeLabel":
> "pilots"},
>       {"outTable": "sensor", "inTable": "sensorReading", "edgeLabel":
> "hasReading", "fkTable": "sensorReading"},
>       {"outTable": "person", "inTable": "planet", "edgeLabel": "fliesTo"},
>       {"outTable": "satellite", "inTable": "planet", "edgeLabel":
> "orbits"},
>       {"outTable": "person", "inTable": "person", "edgeLabel":
> "friendsWith"}
>     ]
> }
>
> Please allow me to introduce some basic grammar concepts of Gremlin first
>
> - V(): Query vertices, for example: g.V(), g.V(‘v_id’), query all points
> and specific points
> - E(): query edge, there are many types of statements that can be continued
> later
> - id(): Get the id of vertices and edges. Example: g.V().id(), query the
> ids of all vertices
> - label(): Get the labels of vertices and edges. Example: g.V().label(),
> you can query the labels of all vertices
>
> In the traversal operation of the graph database, it is mainly divided into
> the following operations:
>
> 1. Based on the "vertex"
>
>     1) out(label): Access the [vertex's OUT direction adjacent points]
> according to the specified Edge Label (zero Edge Label, representing all
> types of edges; it can also be one or more Edge Labels, representing any
> given Edge Label Side, the same below);
>
>      2) in(label): Access [the IN direction adjacent points of the vertex]
> according to the specified Edge Label;
>
>      3) both(label): Access the [bidirectional adjacent points of the
> vertex] according to the specified Edge Label;
>
>      4) outE(label): Access [the OUT direction adjacent edge of the vertex]
> according to the specified Edge Label;
>
>       5) inE(label): Access [the IN direction adjacent edge of the vertex]
> according to the specified Edge Label;
>
>       6). bothE(label): Access [the two-way adjacent edges of the vertex]
> according to the specified Edge Label;
>
>
> 2. Based on “edge”
>     1) outV(): access [the outgoing vertex of the edge], which refers to
> the [starting vertex of the edge];
>
>      2) inV(): access [the incoming vertex of the edge], the incoming
> vertex refers to the [target vertex of the edge], the vertex pointed by the
> arrow;
>
>      3) bothV(): access [bidirectional vertices of edges];
>
>      4) otherV(): access the [partner vertex of the edge], that is, the
> vertex at the other end relative to the base vertex;
>
>
> Let us classify and discuss in terms of points and edges respectively,
>
> If the user chooses to traverse based on points, they need to query and
> perform aggregation operations based on Edge Label. In the above example,
> if the user needs to count the in-degree and out-degree (both types) based
> on company, then the upper layer uses select count(1) from company can get
> the desired query results (abstracting GremlinSqlSelectSingle by rewriting
> graph traversal operations, such as GroupBy/select, OrderBy, Having/Where
> and other sql operators to ensure the generation of execution plans). For
> those who only want to count out-degrees Or if you can only count the
> in-degree results, it means that you need to join the table operation. In
> other words, if you are counting the out-degree of the company table to the
> country table. Using sql, it is expressed as select count(1) from company
> left join country on company.id = country.id. This corresponds to
> GremlinSqlSelectMulti in the code. It will abstract an assembly class
> (GremlinVertexTable) of a point collection table, including inEdges and
> outEdges two sets, pre-traverse the out-degree edges and in-degree edges
> that match the label and put them into sqlMetadata for caching. When the
> user queries, take out the edge set that meets the conditions from
> sqlMetadata and push down the execution plan.
>
> If the user chooses to traverse based on edges, they need to consider
> aggregating using an attribute in the company's point collection as the
> dimension under the condition that the query is based on the Edge Label.
> That is, the name field in the columns collection. You can write select
> count(1) from company group by name to count out-degree and in-degree (both
> types).
>
>
> This made me realize that I need to write a more complete document. For any
> questions about the above, friends in the community are welcome to ask
> questions and add.
>
> Best wishes to you
>
> Benchao Li <li...@apache.org> 于2023年12月25日周一 22:48写道:
>
> > This sounds very interesting to me, although I'm not familiar with
> > graph databases.
> >
> > I'm more interested in how to represent graph data structure and graph
> > query in SQL, is there any relational database/SQL query engine that
> > has done this before? Do we need to add some special data
> > types/operators for graph processing?
> >
> > Forward Xu <fo...@gmail.com> 于2023年12月25日周一 09:51写道:
> > >
> > > hi
> > >
> > > This is a great feature to extend calcite from regular data queries to
> > > graph queries (calculations),
> > > +1 for looking forward to it.
> > >
> > > forwardxu
> > >
> > > 柳尘 <yu...@gmail.com> 于2023年12月24日周日 11:20写道:
> > >
> > > > Motivation
> > > >
> > > > Hi, community. Currently, more and more users are using some graph
> > > > databases, such as JanusGraph, HugeGraph, etc.
> > > > To do some relationship representation of personnel social networks,
> > > > It is used to count the activity of each user in the social network.
> > Most
> > > > graph databases are in the graph building and graph traversal stage.
> > > > All will be implemented using Gremlin syntax.
> > > > However, this is very unfriendly to users who are not familiar with
> > gremlin
> > > > syntax. While calcite exists as a query framework,
> > > > It also provides an adapter interface to adapt to different database
> > > > dialects, such as parsing, relational algebra conversion, and query
> > plan
> > > > binding.
> > > > Our company has solved the problem of adapting various graph
> databases.
> > > > This is my warehouse:
> > https://github.com/kaori-seasons/calcite-gremlin-sql
> > > >
> > > >
> > > > Background
> > > >
> > > > Calcite itself supports the database language expansion of the
> adapter,
> > > > which enables users to understand the meaning of the grammar.
> > > > Complete the simplification of the dialect. For example, expand
> > SqlNode to
> > > > implement syntax analysis, and expand RelNode to implement logical
> plan
> > > > mapping.
> > > >
> > > > thinkerpop is an adaptation framework for various graph databases. In
> > this
> > > > framework, gremlin syntax is mentioned for the first time.
> > > > It has now become a universal query layer for graph databases. While
> > > > expanding query statements through calcite’s adapter interface,
> > > > We will also use thinkerpop's universal graph database API to provide
> > > > dialect compatibility for different graph databases.
> > > >
> > > > Give a simple example:
> > > > From
> > > >
> > > > SELECT "key" FROM inttype
> > > >
> > > >  maps to
> > > >
> > > >
> > > >
> >
> g.V().hasLabel("inttype").group().unfold().select(Column.values).order().by(_.unfold().id()).project("inttype").
> > > >
> > > >
> >
> by(.project("key").by(.unfold().choose(.has("key"),.values("key"),_.constant("\$%#NULL#
> > > > %\$"))))
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > The design architecture is divided into three layers.
> > > >
> > > > Analytical syntax layer, relational algebra transformation layer,
> > logical
> > > > plan binding layer.
> > > >
> > > > Parsing syntax layer: In the parsing query statement phase, fields
> and
> > > > equivalent conditions are split and converted into points and edges.
> > > >
> > > > Relational algebra layer: Split it into a collection of points and
> > edges,
> > > > and convert it into a TableScan during the aggregation/sorting/query
> > stage
> > > > where calcite abstracts it.
> > > > It is convenient to generate query plans based on conditions and
> field
> > > > information.
> > > > Connection scanning/single table filtering and other methods can be
> > used to
> > > > traverse from any edge/any starting point in the graph
> > > >
> > > > Logical plan binding layer: Bind behaviors such as connection
> > > > scanning/single table filtering/projection to calcite’s planner to
> > generate
> > > > query plans.
> > > >
> > > > The process of generating Gremlin logical plan using select
> statement:
> > > >
> > > > 1. First of all, all graph databases start from a source point to
> > build the
> > > > graph. We will use the GraphTraversalSource provided by thinkerpop.
> > > > As the origin, extract the syntax of the incoming point and side
> > > > information. This step will be implemented in SqlSchemaGrabber
> > > > 2. Secondly, for select/where/having/order by/group by our plan in
> the
> > > > parsing phase is as follows:
> > > >
> > > >   - group by: for a point. There are out-degree and in-degree
> > > > attributes in the graph. From the perspective of the data table, it
> is
> > > > equivalent to converting the table data into IN or OUT.
> > > > These two dimensions are aggregated. This behavior also corresponds
> to
> > > > the table traversal graph operation. During the graph traversal
> > > > process, fold/unfold tags will be generated, representing the
> > > > direction of graph traversal.
> > > >   - select: For the select operation, the operation of scanning the
> > > > entire table can be regarded as projecting all columns into point
> > > > attributes. The value changes of each column are mapped to the
> gremlin
> > > > operation of adding points.
> > > >   - where: The filter operation is used in graph computing semantic
> > > > scenarios. It can be regarded as the edges connected by the
> out-degree
> > > > and in-degree of the filter point, so it does not involve the
> > > > conversion of relational algebra.
> > > >   Instead, it is pushed directly to the logical plan.
> > > >   - order by: In the process of graph traversal, we mentioned that
> > > > fold/unflod will be generated on the path to represent the
> > > > forward/backward direction.
> > > >   If a field is encountered and there is no value that can be sorted,
> > > > we will fall back to the origin of GraphTraversalSource and end the
> > > > sorting operation.
> > > >   If there are values that can be sorted, they will be unified in
> > > > SqlTraversalEngine, in-degree and out-degree will be counted
> > > > separately for aggregation, and then used with group by according to
> > > > label (IN/OUT)
> > > >   - having: The meaning is the same as group by, but the label is
> > > > different (in addition to the IN/OUT columns, specific point fields
> > > > need to be specified)
> > > >
> > > >  Currently, I have only completed unit tests that translate from SQL
> to
> > > > Gremlin execution plan, among which test cases for group by and where
> > are
> > > > to be added. In addition, I will also use mainstream graph databases
> > such
> > > > as neo4j and JanusGraph as examples to write sound integration
> tests. ,
> > > > ensuring that the API of the graph database is successfully called
> > after
> > > > converting the sql request into the gremlin execution plan.
> > > >
> > > > Finally, community members are welcome to give suggestions.
> > > >
> >
> >
> >
> > --
> >
> > Best,
> > Benchao Li
> >
>

Re: Feature: Support gremlin adapter

Posted by Mihai Budiu <mb...@gmail.com>.
I cannot really figure out from this description what you plan to contribute.
Can you write a one paragraph description, perhaps a list of bullet points?
Something like: "parser for Gremlin dialect, converter from Gremlin IR to Calcite IR, etc."?

Mihai
________________________________
From: 柳尘 <yu...@gmail.com>
Sent: Monday, December 25, 2023 3:57 PM
To: dev@calcite.apache.org <de...@calcite.apache.org>
Subject: Re: Feature: Support gremlin adapter

Thank you, thank you very much for your reply. Before I open a new PR for
this requirement, obviously I need to write some test cases to illustrate.
Currently there are some test cases in the repository but no related
integration tests. This work is in progress .

As an example:
As shown below, there is a representation of a point set and an edge set.
You can see that an edge is represented by an out-degree node, an in-degree
node, and a label.

{
    "tables": [

{ "name": "company", "columns": [ \{"name": "name", "type": "string"}
        ]
      },

{ "name": "country", "columns": [ \{"name": "name", "type": "string"}
        ]
      },

{ "name": "planet", "columns": [ \{"name": "name", "type": "string"}
        ]
      },

{ "name": "person", "columns": [ \{"name": "name", "type": "string"}
,
          {"name": "age", "type": "integer"}
        ]
      },

{ "name": "spaceship", "columns": [ \{"name": "name", "type": "string"}
,
          {"name": "model", "type": "string"}
        ]
      },

{ "name": "satellite", "columns": [ \{"name": "name", "type": "string"}
        ]
      },

{ "name": "sensor", "columns": [ \{"name": "name", "type": "string"}
,
          {"name": "type", "type": "string"}
        ]
      },

{ "name": "sensorReading", "columns": [ \{"name": "tstamp", "type":
"long_timestamp", "propertyName": "timestamp"}
,
          {"name": "dt", "type": "long_date", "propertyName": "date"},
          {"name": "value", "type": "double"}
        ]
      },

{ "name": "fliesTo", "columns":[ \{"name": "trips", "type": "integer"}
        ]
      },

{ "name": "orbits", "columns":[ \{"name": "launched", "type": "integer"}
        ]
      }
      ],
    "relationships": [
      {"outTable": "company", "inTable": "country", "edgeLabel": "baseIn"},
      {"outTable": "person", "inTable": "company", "edgeLabel": "worksFor"},
      {"outTable": "person", "inTable": "planets", "edgeLabel":
"travelledTo"},
      {"outTable": "company", "inTable": "spaceship", "edgeLabel": "owns"},
      {"outTable": "person", "inTable": "spaceship", "edgeLabel": "pilots"},
      {"outTable": "sensor", "inTable": "sensorReading", "edgeLabel":
"hasReading", "fkTable": "sensorReading"},
      {"outTable": "person", "inTable": "planet", "edgeLabel": "fliesTo"},
      {"outTable": "satellite", "inTable": "planet", "edgeLabel": "orbits"},
      {"outTable": "person", "inTable": "person", "edgeLabel":
"friendsWith"}
    ]
}

Please allow me to introduce some basic grammar concepts of Gremlin first

- V(): Query vertices, for example: g.V(), g.V(‘v_id’), query all points
and specific points
- E(): query edge, there are many types of statements that can be continued
later
- id(): Get the id of vertices and edges. Example: g.V().id(), query the
ids of all vertices
- label(): Get the labels of vertices and edges. Example: g.V().label(),
you can query the labels of all vertices

In the traversal operation of the graph database, it is mainly divided into
the following operations:

1. Based on the "vertex"

    1) out(label): Access the [vertex's OUT direction adjacent points]
according to the specified Edge Label (zero Edge Label, representing all
types of edges; it can also be one or more Edge Labels, representing any
given Edge Label Side, the same below);

     2) in(label): Access [the IN direction adjacent points of the vertex]
according to the specified Edge Label;

     3) both(label): Access the [bidirectional adjacent points of the
vertex] according to the specified Edge Label;

     4) outE(label): Access [the OUT direction adjacent edge of the vertex]
according to the specified Edge Label;

      5) inE(label): Access [the IN direction adjacent edge of the vertex]
according to the specified Edge Label;

      6). bothE(label): Access [the two-way adjacent edges of the vertex]
according to the specified Edge Label;


2. Based on “edge”
    1) outV(): access [the outgoing vertex of the edge], which refers to
the [starting vertex of the edge];

     2) inV(): access [the incoming vertex of the edge], the incoming
vertex refers to the [target vertex of the edge], the vertex pointed by the
arrow;

     3) bothV(): access [bidirectional vertices of edges];

     4) otherV(): access the [partner vertex of the edge], that is, the
vertex at the other end relative to the base vertex;


Let us classify and discuss in terms of points and edges respectively,

If the user chooses to traverse based on points, they need to query and
perform aggregation operations based on Edge Label. In the above example,
if the user needs to count the in-degree and out-degree (both types) based
on company, then the upper layer uses select count(1) from company can get
the desired query results (abstracting GremlinSqlSelectSingle by rewriting
graph traversal operations, such as GroupBy/select, OrderBy, Having/Where
and other sql operators to ensure the generation of execution plans). For
those who only want to count out-degrees Or if you can only count the
in-degree results, it means that you need to join the table operation. In
other words, if you are counting the out-degree of the company table to the
country table. Using sql, it is expressed as select count(1) from company
left join country on company.id = country.id. This corresponds to
GremlinSqlSelectMulti in the code. It will abstract an assembly class
(GremlinVertexTable) of a point collection table, including inEdges and
outEdges two sets, pre-traverse the out-degree edges and in-degree edges
that match the label and put them into sqlMetadata for caching. When the
user queries, take out the edge set that meets the conditions from
sqlMetadata and push down the execution plan.

If the user chooses to traverse based on edges, they need to consider
aggregating using an attribute in the company's point collection as the
dimension under the condition that the query is based on the Edge Label.
That is, the name field in the columns collection. You can write select
count(1) from company group by name to count out-degree and in-degree (both
types).


This made me realize that I need to write a more complete document. For any
questions about the above, friends in the community are welcome to ask
questions and add.

Best wishes to you

Benchao Li <li...@apache.org> 于2023年12月25日周一 22:48写道:

> This sounds very interesting to me, although I'm not familiar with
> graph databases.
>
> I'm more interested in how to represent graph data structure and graph
> query in SQL, is there any relational database/SQL query engine that
> has done this before? Do we need to add some special data
> types/operators for graph processing?
>
> Forward Xu <fo...@gmail.com> 于2023年12月25日周一 09:51写道:
> >
> > hi
> >
> > This is a great feature to extend calcite from regular data queries to
> > graph queries (calculations),
> > +1 for looking forward to it.
> >
> > forwardxu
> >
> > 柳尘 <yu...@gmail.com> 于2023年12月24日周日 11:20写道:
> >
> > > Motivation
> > >
> > > Hi, community. Currently, more and more users are using some graph
> > > databases, such as JanusGraph, HugeGraph, etc.
> > > To do some relationship representation of personnel social networks,
> > > It is used to count the activity of each user in the social network.
> Most
> > > graph databases are in the graph building and graph traversal stage.
> > > All will be implemented using Gremlin syntax.
> > > However, this is very unfriendly to users who are not familiar with
> gremlin
> > > syntax. While calcite exists as a query framework,
> > > It also provides an adapter interface to adapt to different database
> > > dialects, such as parsing, relational algebra conversion, and query
> plan
> > > binding.
> > > Our company has solved the problem of adapting various graph databases.
> > > This is my warehouse:
> https://github.com/kaori-seasons/calcite-gremlin-sql
> > >
> > >
> > > Background
> > >
> > > Calcite itself supports the database language expansion of the adapter,
> > > which enables users to understand the meaning of the grammar.
> > > Complete the simplification of the dialect. For example, expand
> SqlNode to
> > > implement syntax analysis, and expand RelNode to implement logical plan
> > > mapping.
> > >
> > > thinkerpop is an adaptation framework for various graph databases. In
> this
> > > framework, gremlin syntax is mentioned for the first time.
> > > It has now become a universal query layer for graph databases. While
> > > expanding query statements through calcite’s adapter interface,
> > > We will also use thinkerpop's universal graph database API to provide
> > > dialect compatibility for different graph databases.
> > >
> > > Give a simple example:
> > > From
> > >
> > > SELECT "key" FROM inttype
> > >
> > >  maps to
> > >
> > >
> > >
> g.V().hasLabel("inttype").group().unfold().select(Column.values).order().by(_.unfold().id()).project("inttype").
> > >
> > >
> by(.project("key").by(.unfold().choose(.has("key"),.values("key"),_.constant("\$%#NULL#
> > > %\$"))))
> > >
> > >
> > >
> > >
> > >
> > > The design architecture is divided into three layers.
> > >
> > > Analytical syntax layer, relational algebra transformation layer,
> logical
> > > plan binding layer.
> > >
> > > Parsing syntax layer: In the parsing query statement phase, fields and
> > > equivalent conditions are split and converted into points and edges.
> > >
> > > Relational algebra layer: Split it into a collection of points and
> edges,
> > > and convert it into a TableScan during the aggregation/sorting/query
> stage
> > > where calcite abstracts it.
> > > It is convenient to generate query plans based on conditions and field
> > > information.
> > > Connection scanning/single table filtering and other methods can be
> used to
> > > traverse from any edge/any starting point in the graph
> > >
> > > Logical plan binding layer: Bind behaviors such as connection
> > > scanning/single table filtering/projection to calcite’s planner to
> generate
> > > query plans.
> > >
> > > The process of generating Gremlin logical plan using select statement:
> > >
> > > 1. First of all, all graph databases start from a source point to
> build the
> > > graph. We will use the GraphTraversalSource provided by thinkerpop.
> > > As the origin, extract the syntax of the incoming point and side
> > > information. This step will be implemented in SqlSchemaGrabber
> > > 2. Secondly, for select/where/having/order by/group by our plan in the
> > > parsing phase is as follows:
> > >
> > >   - group by: for a point. There are out-degree and in-degree
> > > attributes in the graph. From the perspective of the data table, it is
> > > equivalent to converting the table data into IN or OUT.
> > > These two dimensions are aggregated. This behavior also corresponds to
> > > the table traversal graph operation. During the graph traversal
> > > process, fold/unfold tags will be generated, representing the
> > > direction of graph traversal.
> > >   - select: For the select operation, the operation of scanning the
> > > entire table can be regarded as projecting all columns into point
> > > attributes. The value changes of each column are mapped to the gremlin
> > > operation of adding points.
> > >   - where: The filter operation is used in graph computing semantic
> > > scenarios. It can be regarded as the edges connected by the out-degree
> > > and in-degree of the filter point, so it does not involve the
> > > conversion of relational algebra.
> > >   Instead, it is pushed directly to the logical plan.
> > >   - order by: In the process of graph traversal, we mentioned that
> > > fold/unflod will be generated on the path to represent the
> > > forward/backward direction.
> > >   If a field is encountered and there is no value that can be sorted,
> > > we will fall back to the origin of GraphTraversalSource and end the
> > > sorting operation.
> > >   If there are values that can be sorted, they will be unified in
> > > SqlTraversalEngine, in-degree and out-degree will be counted
> > > separately for aggregation, and then used with group by according to
> > > label (IN/OUT)
> > >   - having: The meaning is the same as group by, but the label is
> > > different (in addition to the IN/OUT columns, specific point fields
> > > need to be specified)
> > >
> > >  Currently, I have only completed unit tests that translate from SQL to
> > > Gremlin execution plan, among which test cases for group by and where
> are
> > > to be added. In addition, I will also use mainstream graph databases
> such
> > > as neo4j and JanusGraph as examples to write sound integration tests. ,
> > > ensuring that the API of the graph database is successfully called
> after
> > > converting the sql request into the gremlin execution plan.
> > >
> > > Finally, community members are welcome to give suggestions.
> > >
>
>
>
> --
>
> Best,
> Benchao Li
>

Re: Feature: Support gremlin adapter

Posted by 柳尘 <yu...@gmail.com>.
Thank you, thank you very much for your reply. Before I open a new PR for
this requirement, obviously I need to write some test cases to illustrate.
Currently there are some test cases in the repository but no related
integration tests. This work is in progress .

As an example:
As shown below, there is a representation of a point set and an edge set.
You can see that an edge is represented by an out-degree node, an in-degree
node, and a label.

{
    "tables": [

{ "name": "company", "columns": [ \{"name": "name", "type": "string"}
        ]
      },

{ "name": "country", "columns": [ \{"name": "name", "type": "string"}
        ]
      },

{ "name": "planet", "columns": [ \{"name": "name", "type": "string"}
        ]
      },

{ "name": "person", "columns": [ \{"name": "name", "type": "string"}
,
          {"name": "age", "type": "integer"}
        ]
      },

{ "name": "spaceship", "columns": [ \{"name": "name", "type": "string"}
,
          {"name": "model", "type": "string"}
        ]
      },

{ "name": "satellite", "columns": [ \{"name": "name", "type": "string"}
        ]
      },

{ "name": "sensor", "columns": [ \{"name": "name", "type": "string"}
,
          {"name": "type", "type": "string"}
        ]
      },

{ "name": "sensorReading", "columns": [ \{"name": "tstamp", "type":
"long_timestamp", "propertyName": "timestamp"}
,
          {"name": "dt", "type": "long_date", "propertyName": "date"},
          {"name": "value", "type": "double"}
        ]
      },

{ "name": "fliesTo", "columns":[ \{"name": "trips", "type": "integer"}
        ]
      },

{ "name": "orbits", "columns":[ \{"name": "launched", "type": "integer"}
        ]
      }
      ],
    "relationships": [
      {"outTable": "company", "inTable": "country", "edgeLabel": "baseIn"},
      {"outTable": "person", "inTable": "company", "edgeLabel": "worksFor"},
      {"outTable": "person", "inTable": "planets", "edgeLabel":
"travelledTo"},
      {"outTable": "company", "inTable": "spaceship", "edgeLabel": "owns"},
      {"outTable": "person", "inTable": "spaceship", "edgeLabel": "pilots"},
      {"outTable": "sensor", "inTable": "sensorReading", "edgeLabel":
"hasReading", "fkTable": "sensorReading"},
      {"outTable": "person", "inTable": "planet", "edgeLabel": "fliesTo"},
      {"outTable": "satellite", "inTable": "planet", "edgeLabel": "orbits"},
      {"outTable": "person", "inTable": "person", "edgeLabel":
"friendsWith"}
    ]
}

Please allow me to introduce some basic grammar concepts of Gremlin first

- V(): Query vertices, for example: g.V(), g.V(‘v_id’), query all points
and specific points
- E(): query edge, there are many types of statements that can be continued
later
- id(): Get the id of vertices and edges. Example: g.V().id(), query the
ids of all vertices
- label(): Get the labels of vertices and edges. Example: g.V().label(),
you can query the labels of all vertices

In the traversal operation of the graph database, it is mainly divided into
the following operations:

1. Based on the "vertex"

    1) out(label): Access the [vertex's OUT direction adjacent points]
according to the specified Edge Label (zero Edge Label, representing all
types of edges; it can also be one or more Edge Labels, representing any
given Edge Label Side, the same below);

     2) in(label): Access [the IN direction adjacent points of the vertex]
according to the specified Edge Label;

     3) both(label): Access the [bidirectional adjacent points of the
vertex] according to the specified Edge Label;

     4) outE(label): Access [the OUT direction adjacent edge of the vertex]
according to the specified Edge Label;

      5) inE(label): Access [the IN direction adjacent edge of the vertex]
according to the specified Edge Label;

      6). bothE(label): Access [the two-way adjacent edges of the vertex]
according to the specified Edge Label;


2. Based on “edge”
    1) outV(): access [the outgoing vertex of the edge], which refers to
the [starting vertex of the edge];

     2) inV(): access [the incoming vertex of the edge], the incoming
vertex refers to the [target vertex of the edge], the vertex pointed by the
arrow;

     3) bothV(): access [bidirectional vertices of edges];

     4) otherV(): access the [partner vertex of the edge], that is, the
vertex at the other end relative to the base vertex;


Let us classify and discuss in terms of points and edges respectively,

If the user chooses to traverse based on points, they need to query and
perform aggregation operations based on Edge Label. In the above example,
if the user needs to count the in-degree and out-degree (both types) based
on company, then the upper layer uses select count(1) from company can get
the desired query results (abstracting GremlinSqlSelectSingle by rewriting
graph traversal operations, such as GroupBy/select, OrderBy, Having/Where
and other sql operators to ensure the generation of execution plans). For
those who only want to count out-degrees Or if you can only count the
in-degree results, it means that you need to join the table operation. In
other words, if you are counting the out-degree of the company table to the
country table. Using sql, it is expressed as select count(1) from company
left join country on company.id = country.id. This corresponds to
GremlinSqlSelectMulti in the code. It will abstract an assembly class
(GremlinVertexTable) of a point collection table, including inEdges and
outEdges two sets, pre-traverse the out-degree edges and in-degree edges
that match the label and put them into sqlMetadata for caching. When the
user queries, take out the edge set that meets the conditions from
sqlMetadata and push down the execution plan.

If the user chooses to traverse based on edges, they need to consider
aggregating using an attribute in the company's point collection as the
dimension under the condition that the query is based on the Edge Label.
That is, the name field in the columns collection. You can write select
count(1) from company group by name to count out-degree and in-degree (both
types).


This made me realize that I need to write a more complete document. For any
questions about the above, friends in the community are welcome to ask
questions and add.

Best wishes to you

Benchao Li <li...@apache.org> 于2023年12月25日周一 22:48写道:

> This sounds very interesting to me, although I'm not familiar with
> graph databases.
>
> I'm more interested in how to represent graph data structure and graph
> query in SQL, is there any relational database/SQL query engine that
> has done this before? Do we need to add some special data
> types/operators for graph processing?
>
> Forward Xu <fo...@gmail.com> 于2023年12月25日周一 09:51写道:
> >
> > hi
> >
> > This is a great feature to extend calcite from regular data queries to
> > graph queries (calculations),
> > +1 for looking forward to it.
> >
> > forwardxu
> >
> > 柳尘 <yu...@gmail.com> 于2023年12月24日周日 11:20写道:
> >
> > > Motivation
> > >
> > > Hi, community. Currently, more and more users are using some graph
> > > databases, such as JanusGraph, HugeGraph, etc.
> > > To do some relationship representation of personnel social networks,
> > > It is used to count the activity of each user in the social network.
> Most
> > > graph databases are in the graph building and graph traversal stage.
> > > All will be implemented using Gremlin syntax.
> > > However, this is very unfriendly to users who are not familiar with
> gremlin
> > > syntax. While calcite exists as a query framework,
> > > It also provides an adapter interface to adapt to different database
> > > dialects, such as parsing, relational algebra conversion, and query
> plan
> > > binding.
> > > Our company has solved the problem of adapting various graph databases.
> > > This is my warehouse:
> https://github.com/kaori-seasons/calcite-gremlin-sql
> > >
> > >
> > > Background
> > >
> > > Calcite itself supports the database language expansion of the adapter,
> > > which enables users to understand the meaning of the grammar.
> > > Complete the simplification of the dialect. For example, expand
> SqlNode to
> > > implement syntax analysis, and expand RelNode to implement logical plan
> > > mapping.
> > >
> > > thinkerpop is an adaptation framework for various graph databases. In
> this
> > > framework, gremlin syntax is mentioned for the first time.
> > > It has now become a universal query layer for graph databases. While
> > > expanding query statements through calcite’s adapter interface,
> > > We will also use thinkerpop's universal graph database API to provide
> > > dialect compatibility for different graph databases.
> > >
> > > Give a simple example:
> > > From
> > >
> > > SELECT "key" FROM inttype
> > >
> > >  maps to
> > >
> > >
> > >
> g.V().hasLabel("inttype").group().unfold().select(Column.values).order().by(_.unfold().id()).project("inttype").
> > >
> > >
> by(.project("key").by(.unfold().choose(.has("key"),.values("key"),_.constant("\$%#NULL#
> > > %\$"))))
> > >
> > >
> > >
> > >
> > >
> > > The design architecture is divided into three layers.
> > >
> > > Analytical syntax layer, relational algebra transformation layer,
> logical
> > > plan binding layer.
> > >
> > > Parsing syntax layer: In the parsing query statement phase, fields and
> > > equivalent conditions are split and converted into points and edges.
> > >
> > > Relational algebra layer: Split it into a collection of points and
> edges,
> > > and convert it into a TableScan during the aggregation/sorting/query
> stage
> > > where calcite abstracts it.
> > > It is convenient to generate query plans based on conditions and field
> > > information.
> > > Connection scanning/single table filtering and other methods can be
> used to
> > > traverse from any edge/any starting point in the graph
> > >
> > > Logical plan binding layer: Bind behaviors such as connection
> > > scanning/single table filtering/projection to calcite’s planner to
> generate
> > > query plans.
> > >
> > > The process of generating Gremlin logical plan using select statement:
> > >
> > > 1. First of all, all graph databases start from a source point to
> build the
> > > graph. We will use the GraphTraversalSource provided by thinkerpop.
> > > As the origin, extract the syntax of the incoming point and side
> > > information. This step will be implemented in SqlSchemaGrabber
> > > 2. Secondly, for select/where/having/order by/group by our plan in the
> > > parsing phase is as follows:
> > >
> > >   - group by: for a point. There are out-degree and in-degree
> > > attributes in the graph. From the perspective of the data table, it is
> > > equivalent to converting the table data into IN or OUT.
> > > These two dimensions are aggregated. This behavior also corresponds to
> > > the table traversal graph operation. During the graph traversal
> > > process, fold/unfold tags will be generated, representing the
> > > direction of graph traversal.
> > >   - select: For the select operation, the operation of scanning the
> > > entire table can be regarded as projecting all columns into point
> > > attributes. The value changes of each column are mapped to the gremlin
> > > operation of adding points.
> > >   - where: The filter operation is used in graph computing semantic
> > > scenarios. It can be regarded as the edges connected by the out-degree
> > > and in-degree of the filter point, so it does not involve the
> > > conversion of relational algebra.
> > >   Instead, it is pushed directly to the logical plan.
> > >   - order by: In the process of graph traversal, we mentioned that
> > > fold/unflod will be generated on the path to represent the
> > > forward/backward direction.
> > >   If a field is encountered and there is no value that can be sorted,
> > > we will fall back to the origin of GraphTraversalSource and end the
> > > sorting operation.
> > >   If there are values that can be sorted, they will be unified in
> > > SqlTraversalEngine, in-degree and out-degree will be counted
> > > separately for aggregation, and then used with group by according to
> > > label (IN/OUT)
> > >   - having: The meaning is the same as group by, but the label is
> > > different (in addition to the IN/OUT columns, specific point fields
> > > need to be specified)
> > >
> > >  Currently, I have only completed unit tests that translate from SQL to
> > > Gremlin execution plan, among which test cases for group by and where
> are
> > > to be added. In addition, I will also use mainstream graph databases
> such
> > > as neo4j and JanusGraph as examples to write sound integration tests. ,
> > > ensuring that the API of the graph database is successfully called
> after
> > > converting the sql request into the gremlin execution plan.
> > >
> > > Finally, community members are welcome to give suggestions.
> > >
>
>
>
> --
>
> Best,
> Benchao Li
>

Re: Feature: Support gremlin adapter

Posted by Benchao Li <li...@apache.org>.
This sounds very interesting to me, although I'm not familiar with
graph databases.

I'm more interested in how to represent graph data structure and graph
query in SQL, is there any relational database/SQL query engine that
has done this before? Do we need to add some special data
types/operators for graph processing?

Forward Xu <fo...@gmail.com> 于2023年12月25日周一 09:51写道:
>
> hi
>
> This is a great feature to extend calcite from regular data queries to
> graph queries (calculations),
> +1 for looking forward to it.
>
> forwardxu
>
> 柳尘 <yu...@gmail.com> 于2023年12月24日周日 11:20写道:
>
> > Motivation
> >
> > Hi, community. Currently, more and more users are using some graph
> > databases, such as JanusGraph, HugeGraph, etc.
> > To do some relationship representation of personnel social networks,
> > It is used to count the activity of each user in the social network. Most
> > graph databases are in the graph building and graph traversal stage.
> > All will be implemented using Gremlin syntax.
> > However, this is very unfriendly to users who are not familiar with gremlin
> > syntax. While calcite exists as a query framework,
> > It also provides an adapter interface to adapt to different database
> > dialects, such as parsing, relational algebra conversion, and query plan
> > binding.
> > Our company has solved the problem of adapting various graph databases.
> > This is my warehouse: https://github.com/kaori-seasons/calcite-gremlin-sql
> >
> >
> > Background
> >
> > Calcite itself supports the database language expansion of the adapter,
> > which enables users to understand the meaning of the grammar.
> > Complete the simplification of the dialect. For example, expand SqlNode to
> > implement syntax analysis, and expand RelNode to implement logical plan
> > mapping.
> >
> > thinkerpop is an adaptation framework for various graph databases. In this
> > framework, gremlin syntax is mentioned for the first time.
> > It has now become a universal query layer for graph databases. While
> > expanding query statements through calcite’s adapter interface,
> > We will also use thinkerpop's universal graph database API to provide
> > dialect compatibility for different graph databases.
> >
> > Give a simple example:
> > From
> >
> > SELECT "key" FROM inttype
> >
> >  maps to
> >
> >
> > g.V().hasLabel("inttype").group().unfold().select(Column.values).order().by(_.unfold().id()).project("inttype").
> >
> > by(.project("key").by(.unfold().choose(.has("key"),.values("key"),_.constant("\$%#NULL#
> > %\$"))))
> >
> >
> >
> >
> >
> > The design architecture is divided into three layers.
> >
> > Analytical syntax layer, relational algebra transformation layer, logical
> > plan binding layer.
> >
> > Parsing syntax layer: In the parsing query statement phase, fields and
> > equivalent conditions are split and converted into points and edges.
> >
> > Relational algebra layer: Split it into a collection of points and edges,
> > and convert it into a TableScan during the aggregation/sorting/query stage
> > where calcite abstracts it.
> > It is convenient to generate query plans based on conditions and field
> > information.
> > Connection scanning/single table filtering and other methods can be used to
> > traverse from any edge/any starting point in the graph
> >
> > Logical plan binding layer: Bind behaviors such as connection
> > scanning/single table filtering/projection to calcite’s planner to generate
> > query plans.
> >
> > The process of generating Gremlin logical plan using select statement:
> >
> > 1. First of all, all graph databases start from a source point to build the
> > graph. We will use the GraphTraversalSource provided by thinkerpop.
> > As the origin, extract the syntax of the incoming point and side
> > information. This step will be implemented in SqlSchemaGrabber
> > 2. Secondly, for select/where/having/order by/group by our plan in the
> > parsing phase is as follows:
> >
> >   - group by: for a point. There are out-degree and in-degree
> > attributes in the graph. From the perspective of the data table, it is
> > equivalent to converting the table data into IN or OUT.
> > These two dimensions are aggregated. This behavior also corresponds to
> > the table traversal graph operation. During the graph traversal
> > process, fold/unfold tags will be generated, representing the
> > direction of graph traversal.
> >   - select: For the select operation, the operation of scanning the
> > entire table can be regarded as projecting all columns into point
> > attributes. The value changes of each column are mapped to the gremlin
> > operation of adding points.
> >   - where: The filter operation is used in graph computing semantic
> > scenarios. It can be regarded as the edges connected by the out-degree
> > and in-degree of the filter point, so it does not involve the
> > conversion of relational algebra.
> >   Instead, it is pushed directly to the logical plan.
> >   - order by: In the process of graph traversal, we mentioned that
> > fold/unflod will be generated on the path to represent the
> > forward/backward direction.
> >   If a field is encountered and there is no value that can be sorted,
> > we will fall back to the origin of GraphTraversalSource and end the
> > sorting operation.
> >   If there are values that can be sorted, they will be unified in
> > SqlTraversalEngine, in-degree and out-degree will be counted
> > separately for aggregation, and then used with group by according to
> > label (IN/OUT)
> >   - having: The meaning is the same as group by, but the label is
> > different (in addition to the IN/OUT columns, specific point fields
> > need to be specified)
> >
> >  Currently, I have only completed unit tests that translate from SQL to
> > Gremlin execution plan, among which test cases for group by and where are
> > to be added. In addition, I will also use mainstream graph databases such
> > as neo4j and JanusGraph as examples to write sound integration tests. ,
> > ensuring that the API of the graph database is successfully called after
> > converting the sql request into the gremlin execution plan.
> >
> > Finally, community members are welcome to give suggestions.
> >



-- 

Best,
Benchao Li

Re: Feature: Support gremlin adapter

Posted by Forward Xu <fo...@gmail.com>.
hi

This is a great feature to extend calcite from regular data queries to
graph queries (calculations),
+1 for looking forward to it.

forwardxu

柳尘 <yu...@gmail.com> 于2023年12月24日周日 11:20写道:

> Motivation
>
> Hi, community. Currently, more and more users are using some graph
> databases, such as JanusGraph, HugeGraph, etc.
> To do some relationship representation of personnel social networks,
> It is used to count the activity of each user in the social network. Most
> graph databases are in the graph building and graph traversal stage.
> All will be implemented using Gremlin syntax.
> However, this is very unfriendly to users who are not familiar with gremlin
> syntax. While calcite exists as a query framework,
> It also provides an adapter interface to adapt to different database
> dialects, such as parsing, relational algebra conversion, and query plan
> binding.
> Our company has solved the problem of adapting various graph databases.
> This is my warehouse: https://github.com/kaori-seasons/calcite-gremlin-sql
>
>
> Background
>
> Calcite itself supports the database language expansion of the adapter,
> which enables users to understand the meaning of the grammar.
> Complete the simplification of the dialect. For example, expand SqlNode to
> implement syntax analysis, and expand RelNode to implement logical plan
> mapping.
>
> thinkerpop is an adaptation framework for various graph databases. In this
> framework, gremlin syntax is mentioned for the first time.
> It has now become a universal query layer for graph databases. While
> expanding query statements through calcite’s adapter interface,
> We will also use thinkerpop's universal graph database API to provide
> dialect compatibility for different graph databases.
>
> Give a simple example:
> From
>
> SELECT "key" FROM inttype
>
>  maps to
>
>
> g.V().hasLabel("inttype").group().unfold().select(Column.values).order().by(_.unfold().id()).project("inttype").
>
> by(.project("key").by(.unfold().choose(.has("key"),.values("key"),_.constant("\$%#NULL#
> %\$"))))
>
>
>
>
>
> The design architecture is divided into three layers.
>
> Analytical syntax layer, relational algebra transformation layer, logical
> plan binding layer.
>
> Parsing syntax layer: In the parsing query statement phase, fields and
> equivalent conditions are split and converted into points and edges.
>
> Relational algebra layer: Split it into a collection of points and edges,
> and convert it into a TableScan during the aggregation/sorting/query stage
> where calcite abstracts it.
> It is convenient to generate query plans based on conditions and field
> information.
> Connection scanning/single table filtering and other methods can be used to
> traverse from any edge/any starting point in the graph
>
> Logical plan binding layer: Bind behaviors such as connection
> scanning/single table filtering/projection to calcite’s planner to generate
> query plans.
>
> The process of generating Gremlin logical plan using select statement:
>
> 1. First of all, all graph databases start from a source point to build the
> graph. We will use the GraphTraversalSource provided by thinkerpop.
> As the origin, extract the syntax of the incoming point and side
> information. This step will be implemented in SqlSchemaGrabber
> 2. Secondly, for select/where/having/order by/group by our plan in the
> parsing phase is as follows:
>
>   - group by: for a point. There are out-degree and in-degree
> attributes in the graph. From the perspective of the data table, it is
> equivalent to converting the table data into IN or OUT.
> These two dimensions are aggregated. This behavior also corresponds to
> the table traversal graph operation. During the graph traversal
> process, fold/unfold tags will be generated, representing the
> direction of graph traversal.
>   - select: For the select operation, the operation of scanning the
> entire table can be regarded as projecting all columns into point
> attributes. The value changes of each column are mapped to the gremlin
> operation of adding points.
>   - where: The filter operation is used in graph computing semantic
> scenarios. It can be regarded as the edges connected by the out-degree
> and in-degree of the filter point, so it does not involve the
> conversion of relational algebra.
>   Instead, it is pushed directly to the logical plan.
>   - order by: In the process of graph traversal, we mentioned that
> fold/unflod will be generated on the path to represent the
> forward/backward direction.
>   If a field is encountered and there is no value that can be sorted,
> we will fall back to the origin of GraphTraversalSource and end the
> sorting operation.
>   If there are values that can be sorted, they will be unified in
> SqlTraversalEngine, in-degree and out-degree will be counted
> separately for aggregation, and then used with group by according to
> label (IN/OUT)
>   - having: The meaning is the same as group by, but the label is
> different (in addition to the IN/OUT columns, specific point fields
> need to be specified)
>
>  Currently, I have only completed unit tests that translate from SQL to
> Gremlin execution plan, among which test cases for group by and where are
> to be added. In addition, I will also use mainstream graph databases such
> as neo4j and JanusGraph as examples to write sound integration tests. ,
> ensuring that the API of the graph database is successfully called after
> converting the sql request into the gremlin execution plan.
>
> Finally, community members are welcome to give suggestions.
>