You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tinkerpop.apache.org by Joshua Shinavier <jo...@fortytwo.net> on 2021/06/07 17:36:37 UTC

[DISCUSS] Property types

Hi all,

As we move ahead toward schema support in TinkerPop, one of the concerns we
need to keep in mind is future interoperability with GQL
<https://www.gqlstandards.org/>, the ISO standard property graph query
language which is currently under development. GQL will encompass a
formally-defined data model and schema language for property graphs. One of
the most important inputs to GQL, in terms of schema, is the Property Graph
Schema Working Group (PGSWG
<https://3.basecamp.com/4100172/projects/10013370>), which is concerned
with coming up with that formal data model. While PGSWG is not open to the
public, it is easy enough to become a member of LDBC if you would like to
get involved (feel free to ping me if so). You can also request access to this
doc
<https://docs.google.com/document/d/16ozf76GlEq-YPZa0UJVj8gG6GgUeL_x9M5Wj35JiZP0/edit>
if you would like some insight into the discussion around property types,
which is the topic of the subgroup I lead. What is an appropriate type
system for property values, and by extension, graph elements? In this email
thread, I would like to expose some of the issues we have been discussing
to the TinkerPop community, and get your feedback. While TinkerPop's schema
language will not be based directly on GQL (for one thing, the
specification probably will not be ready for at least another year), it is
important to aim for compatibility, as this is what the major graph vendors
will be expected to implement in the future, if all goes well. There is
also an opportunity for TinkerPop to lead the way and demonstrate
applications of such a schema language in advance of a standard, which in
turn will inform the standard.

Without trying to give an exhaustive overview at this time, here are some
things I would like to mention:

   - *Prescriptive vs. descriptive schemas*. This is very important for
   property graphs. Whereas TinkerPop and most other PG solutions are
   relatively schema-less, there is strong motivation for a real,
   vendor-neutral schema language, which is what is driving the PGSWG. At the
   same time, a solution which requires all graph data to strictly conform to
   a schema in all contexts would not be supportive of typical applications,
   so there needs to be a spectrum. We are still discussing ways in which to
   provide the flexibility of schemas which describe parts of a graph and help
   with validation and inference over those parts, but which are tolerant of
   "other stuff" which goes beyond the schema. For TinkerPop, I think it will
   be simplest to start out with a binary world: either your graph is
   schemaless, i.e. you just don't have or just don't care about a schema, or
   it is expected to strictly conform to a predefined schema. Long-term, we
   are likely to move toward more of a true spectrum, and I think it's
   reasonably clear how to make that transition. In the beginning, you either
   have a schema or you don't.


   - *Algebraic types*. After all of the discussions I have had in the
   working group, at my company, and in the graph community, I still find
   algebraic data types to be the most promising basis for a TinkerPop schema
   language, and this is essentially what I am recommending for GQL as well.
   The initial proposal to GQL will include a kind of "everything but the
   kitchen sink" type grammar which has algebraic as well as non-algebraic
   type constructors (allowing implementations to pick and choose based on
   what works for them), but in TinkerPop I think we can be more focused, and
   I would welcome any discussion around this topic. By algebraic data types,
   I mean products (records) and sums (unions), together with primitives and
   named types (labels) along the lines of Algebraic Property Graphs (APG
   <https://arxiv.org/abs/1909.04881>). An algebraic type system of this
   kind has the advantage of being very straightforward to reason about, while
   also being well aligned with enterprise data languages. There will be
   other, more detailed posts from me to this list on the type system I
   propose to use in TinkerPop.


   - *Atomic types*. As part of the basic type system for property graphs,
   there is broad agreement that there should be a collection of predefined
   atomic types (called "primitive types" in the APG paper and in Dragon) like
   integers, floating-point numbers, character strings, and booleans. There is
   also agreement within the property types group that this collection of
   atomic types may be infinite, and that parameterization of atomic types
   should not be part of the standard schema language. For example, in
   addition to a 32-bit integer type, you might also have a 16-bit integer
   type, a 64-bit integer type, and so on... potentially any number of
   "integer" data types related to each other by a parameter. In GQL, these
   will likely just be given names like int32, int16, etc. and it will be up
   to implementations to determine the internal structure of the types, if
   any. Other possible parameters for integer types, for example, are
   signedness and width encoding. In TinkerPop, I suspect that we will take a
   slightly different approach (and one similar to Dragon) in that the grammar
   will in fact provide vocabulary for type parameters, and we will endeavor
   to make the parameter language expressive enough to serve the needs of most
   graph vendors. That makes the parameterization accessible to programming
   logic for optimizing Gremlin traversals, optimizing graph encodings, etc.
   However, all of this is to be discussed in detail on the dev list.


   - *Nominal vs. structural typing*. Another major area of discussion
   within PGSWG concerns the degree to which property graph schemas should be
   provided up front by the developer, vs. inferred from the structure of the
   graph. In my (enterprise) world, this choice is a no-brainer: you need to
   provide a schema up front -- where you know what type to expect for each
   graph element, and by type inference, of every unit of data in the graph --
   which will avoid all sorts of unnecessary data integration challenges.
   However, there are compelling use cases for a more bottom-up approach in
   which the schema is discovered as you go. In a nominally typed world, types
   have names, you know which type to impose on any given value in the graph,
   and two values which are structurally the same (e.g. a pair of doubles
   representing a latitude and longitude, versus a pair of doubles
   representing the min and max value of a range) may be unequal by virtue of
   the expected type. In PGSWG, I think we are pretty close to a solution
   which will support variations of both scenarios, but in TinkerPop again I
   would suggest starting with simplicity and strict nominal typing, then
   relaxing our solution in order to accommodate additional use cases.

That's probably a long enough email for now. I am happy to delve deeper
into any of the above, and I will continue to raise new topics on this list.

Josh

Re: [DISCUSS] Property types

Posted by Joshua Shinavier <jo...@fortytwo.net>.

Btw. NumericPrecision:

- name: NumericPrecision
  description: "Integer or floating-point precision in bits"
  type:
    union:
      - name: arbitrary
        description: "Arbitrary precision"

      - name: bits
        description: "Precision limited to a given number of bits"
        type: integer


On Wed, Jun 9, 2021 at 11:52 AM Joshua Shinavier <jo...@fortytwo.net> wrote:

> Hi Stephen,
>
> Responses inline.
>
> On Wed, Jun 9, 2021 at 4:04 AM Stephen Mallette <sp...@gmail.com>
> wrote:
>
>> Thanks for the update Josh
>>
>> [...]
>> >    reasonably clear how to make that transition. In the beginning, you
>> > either
>> >    have a schema or you don't.
>> >
>>
>> Could you clarify who is making that choice? Is it the provider saying
>> their graph supports schema or not? or did you mean the user is making
>> that
>> choice somehow and TinkerPop would thus enforce the schema?
>>
>
>
> At first, I don't think we need native schema support in graph providers.
> There will definitely be advantages to such support (e.g. better indexing,
> better query planning) where available, but there is a lot you can do with
> a schema at the application level, like validation, object-graph mapping
> (like Frames, but with no code other than the schema), and Gremlin
> traversal optimizations. tl;dr yes, it's the user who determines the
> schema, although every provider will come with a set of constraints
> (explicit or implicit) on what kinds of schemas can be supported. E.g. most
> providers do not support record-valued properties, so a schema with a
> record type for a property would be an illegal schema w.r.t. that provider
> (or at least, you'd need a mapping to turn the schema into one which is
> supported, e.g. by encoding records as strings).
>
>
>    - *Atomic types*. As part of the basic type system for property graphs,
>> [...]
>> >    However, all of this is to be discussed in detail on the dev list.
>> >
>>
>> I'm pretty interested in the direction this goes as numbers have always
>> been troublesome to our various language variants and it often doesn't
>> make
>> Gremlin look smart to those users of language off the JVM.
>>
>
>
> Below is the schema for Dragon's primitive types. Booleans and binary
> strings have no parameters, while integer and floating-point types do have
> some parameters. The string type happens to have a maximum-length parameter
> (other commonly asked-for features being minimum length, regex, etc.). This
> is not necessarily the schema we will use for TP4, but it might be close.
> Algebraic Property Graphs does not prescribe any particular set of
> primitive types; Dragon's types represent a pragmatic choice which has been
> appropriate for applications in a particular company. The questions to be
> answered for TinkerPop are: where should we draw the line between features
> which are built in to the framework, vs. extensions/ornamentation which are
> best left to individual graph providers. The PGSWG approach, at the moment,
> is more like APG in that there are no prescribed type parameters, and we're
> still deciding whether there should be built-in atomic types at all
> (leaning toward "yes").
>
> It might be worthwhile if you can summarize the problems we have had with
> numeric types, here or in a separate thread, and then we can talk about how
> we might be able to address them with schemas and a data model
> specification.
>
> Josh
>
>
> - name: PrimitiveType
>   description: "A primitive data type, such as a string or boolean type"
>   type:
>     union:
>       - name: binary
>         description: "The type of a binary value, consisting of a sequence of bytes"
>         type: BinaryType
>
>       - name: boolean
>         description: "The type of a boolean value, consisting of true or false"
>         type: BooleanType
>
>       - name: float
>         description: "The type of a floating-point value"
>         type: FloatType
>
>       - name: integer
>         description: "The type of an integer value"
>         type: IntegerType
>
>       - name: string
>         description: "The type of a string value"
>         type: StringType
>
> - name: BinaryType
>   description: "The type of a binary value, consisting of a sequence of bytes"
>
> - name: BooleanType
>   description: "The type of a boolean value (either true or false)"
>
> - name: FloatType
>   description: "A floating-point data type with a given bit precision"
>   type:
>     record:
>       - name: precision
>         description: "The floating-point precision of the type, in bits. Common precision values are 32 and 64."
>         type: NumericPrecision
>   default:
>     precision:
>       bits: 32
>
> - name: IntegerType
>   description: "An integer data type with a given bit precision, signedness, and optional width encoding"
>   type:
>     record:
>       - name: precision
>         description: "The integer precision of the type, in bits. Common precision values are 32 and 64."
>         type: NumericPrecision
>
>       - name: signed
>         description: "Whether the type represents signed or unsigned integers"
>         type: boolean
>
>       - name: fixedWidth
>         description: "Whether a fixed-width integer or varint encoding is preferred"
>         type:
>           optional: boolean
>   default:
>     precision:
>       bits: 32
>     signed: true
>
> - name: StringType
>   description: "A string data type with an optional maximum length. The encoding scheme is unspecified."
>   type:
>     record:
>       - name: maximumLength
>         description: >
>           If provided, an upper bound (inclusive) on the length of the string.
>           If not provided, then there is no such constraint.
>         type:
>           optional: integer
>
>
> Josh
>
>
>

Re: [DISCUSS] Property types

Posted by Joshua Shinavier <jo...@fortytwo.net>.

Hi Stephen,

Responses inline.

On Wed, Jun 9, 2021 at 4:04 AM Stephen Mallette <sp...@gmail.com>
wrote:

> Thanks for the update Josh
>
> [...]
> >    reasonably clear how to make that transition. In the beginning, you
> > either
> >    have a schema or you don't.
> >
>
> Could you clarify who is making that choice? Is it the provider saying
> their graph supports schema or not? or did you mean the user is making that
> choice somehow and TinkerPop would thus enforce the schema?
>

At first, I don't think we need native schema support in graph providers.
There will definitely be advantages to such support (e.g. better indexing,
better query planning) where available, but there is a lot you can do with
a schema at the application level, like validation, object-graph mapping
(like Frames, but with no code other than the schema), and Gremlin
traversal optimizations. tl;dr yes, it's the user who determines the
schema, although every provider will come with a set of constraints
(explicit or implicit) on what kinds of schemas can be supported. E.g. most
providers do not support record-valued properties, so a schema with a
record type for a property would be an illegal schema w.r.t. that provider
(or at least, you'd need a mapping to turn the schema into one which is
supported, e.g. by encoding records as strings).

   - *Atomic types*. As part of the basic type system for property graphs,
> [...]
> >    However, all of this is to be discussed in detail on the dev list.
> >
>
> I'm pretty interested in the direction this goes as numbers have always
> been troublesome to our various language variants and it often doesn't make
> Gremlin look smart to those users of language off the JVM.
>

Below is the schema for Dragon's primitive types. Booleans and binary
strings have no parameters, while integer and floating-point types do have
some parameters. The string type happens to have a maximum-length parameter
(other commonly asked-for features being minimum length, regex, etc.). This
is not necessarily the schema we will use for TP4, but it might be close.
Algebraic Property Graphs does not prescribe any particular set of
primitive types; Dragon's types represent a pragmatic choice which has been
appropriate for applications in a particular company. The questions to be
answered for TinkerPop are: where should we draw the line between features
which are built in to the framework, vs. extensions/ornamentation which are
best left to individual graph providers. The PGSWG approach, at the moment,
is more like APG in that there are no prescribed type parameters, and we're
still deciding whether there should be built-in atomic types at all
(leaning toward "yes").

It might be worthwhile if you can summarize the problems we have had with
numeric types, here or in a separate thread, and then we can talk about how
we might be able to address them with schemas and a data model
specification.

Josh

- name: PrimitiveType
  description: "A primitive data type, such as a string or boolean type"
  type:
    union:
      - name: binary
        description: "The type of a binary value, consisting of a
sequence of bytes"
        type: BinaryType

      - name: boolean
        description: "The type of a boolean value, consisting of true or false"
        type: BooleanType

      - name: float
        description: "The type of a floating-point value"
        type: FloatType

      - name: integer
        description: "The type of an integer value"
        type: IntegerType

      - name: string
        description: "The type of a string value"
        type: StringType

- name: BinaryType
  description: "The type of a binary value, consisting of a sequence of bytes"

- name: BooleanType
  description: "The type of a boolean value (either true or false)"

- name: FloatType
  description: "A floating-point data type with a given bit precision"
  type:
    record:
      - name: precision
        description: "The floating-point precision of the type, in
bits. Common precision values are 32 and 64."
        type: NumericPrecision
  default:
    precision:
      bits: 32

- name: IntegerType
  description: "An integer data type with a given bit precision,
signedness, and optional width encoding"
  type:
    record:
      - name: precision
        description: "The integer precision of the type, in bits.
Common precision values are 32 and 64."
        type: NumericPrecision

      - name: signed
        description: "Whether the type represents signed or unsigned integers"
        type: boolean

      - name: fixedWidth
        description: "Whether a fixed-width integer or varint encoding
is preferred"
        type:
          optional: boolean
  default:
    precision:
      bits: 32
    signed: true

- name: StringType
  description: "A string data type with an optional maximum length.
The encoding scheme is unspecified."
  type:
    record:
      - name: maximumLength
        description: >
          If provided, an upper bound (inclusive) on the length of the string.
          If not provided, then there is no such constraint.
        type:
          optional: integer

Josh

Re: [DISCUSS] Property types

Posted by Stephen Mallette <sp...@gmail.com>.

Thanks for the update Josh

   - *Prescriptive vs. descriptive schemas*. This is very important for
>    property graphs. Whereas TinkerPop and most other PG solutions are
>    relatively schema-less, there is strong motivation for a real,
>    vendor-neutral schema language, which is what is driving the PGSWG. At
> the
>    same time, a solution which requires all graph data to strictly conform
> to
>    a schema in all contexts would not be supportive of typical
> applications,
>    so there needs to be a spectrum. We are still discussing ways in which
> to
>    provide the flexibility of schemas which describe parts of a graph and
> help
>    with validation and inference over those parts, but which are tolerant
> of
>    "other stuff" which goes beyond the schema. For TinkerPop, I think it
> will
>    be simplest to start out with a binary world: either your graph is
>    schemaless, i.e. you just don't have or just don't care about a schema,
> or
>    it is expected to strictly conform to a predefined schema. Long-term, we
>    are likely to move toward more of a true spectrum, and I think it's
>    reasonably clear how to make that transition. In the beginning, you
> either
>    have a schema or you don't.
>

Could you clarify who is making that choice? Is it the provider saying
their graph supports schema or not? or did you mean the user is making that
choice somehow and TinkerPop would thus enforce the schema?

   - *Atomic types*. As part of the basic type system for property graphs,
>    there is broad agreement that there should be a collection of predefined
>    atomic types (called "primitive types" in the APG paper and in Dragon)
> like
>    integers, floating-point numbers, character strings, and booleans.
> There is
>    also agreement within the property types group that this collection of
>    atomic types may be infinite, and that parameterization of atomic types
>    should not be part of the standard schema language. For example, in
>    addition to a 32-bit integer type, you might also have a 16-bit integer
>    type, a 64-bit integer type, and so on... potentially any number of
>    "integer" data types related to each other by a parameter. In GQL, these
>    will likely just be given names like int32, int16, etc. and it will be
> up
>    to implementations to determine the internal structure of the types, if
>    any. Other possible parameters for integer types, for example, are
>    signedness and width encoding. In TinkerPop, I suspect that we will
> take a
>    slightly different approach (and one similar to Dragon) in that the
> grammar
>    will in fact provide vocabulary for type parameters, and we will
> endeavor
>    to make the parameter language expressive enough to serve the needs of
> most
>    graph vendors. That makes the parameterization accessible to programming
>    logic for optimizing Gremlin traversals, optimizing graph encodings,
> etc.
>    However, all of this is to be discussed in detail on the dev list.
>

I'm pretty interested in the direction this goes as numbers have always
been troublesome to our various language variants and it often doesn't make
Gremlin look smart to those users of language off the JVM.