You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tinkerpop.apache.org by Marko Rodriguez <ok...@gmail.com> on 2019/04/29 14:34:41 UTC

The Fundamental Structure Instructions Already Exist! (w/ RDBMS Example)

Hi,

*** This email is primarily for Josh (and Kuppitz). However, if others are interested… ***

So I did a lot of thinking this weekend about structure/ and this morning, I prototyped both graph/ and rdbms/.

This is the way I’m currently thinking of things:

	1. There are 4 base types in structure/.
		- Primitive: string, long, float, int, … (will constrain these at some point).
		- TTuple<K,V>: key/value map.
		- TSequence<V>: an iterable of v objects.
		- TSymbol: like Ruby, I think we need “enum-like” symbols (e.g., #id, #label).
	
	2. Every structure has a “root.”
		- for graph its TGraph implements TSequence<TVertex>
		- for rdbms its a TDatabase implements TTuple<String,TTable>

	3. Roots implement Structure and thus, are what is generated by StructureFactory.mint().
		- defined using withStructure().
		- For graph, its accessible via V().
		- For rdbms, its accessible via db().

	4. There is a list of core instructions for dealing with these base objects.
		- value(K key): gets the TTuple value for the provided key.
		- values(K key): gets an iterator of the value for the provided key.
		- entries(): gets an iterator of T2Tuple objects for the incoming TTuple.
		- hasXXX(A,B): various has()-based filters for looking into a TTuple and a TSequence
		- db()/V()/etc.: jump to the “root” of the withStructure() structure.
		- drop()/add(): behave as one would expect and thus.

————

For RDBMS, we have three interfaces in rdbms/. (machine/machine-core/structure/rdbms)

	1. TDatabase implements TTuple<String,TTable> // the root structure that indexes the tables.
	2. TTable implements TSequence<TRow<?>> // a table is a sequence of rows
	3. TRow<V> implements TTuple<String,V>> // a row has string column names
	
I then created a new project at machine/structure/jdbc). The classes in here implement the above rdbms/ interfaces/

Here is an RDBMS session:

final Machine machine = LocalMachine.open();
final TraversalSource jdbc =
	Gremlin.traversal(machine).
                        withProcessor(PipesProcessor.class).
                        withStructure(JDBCStructure.class, Map.of(JDBCStructure.JDBC_CONNECTION, "jdbc:h2:/tmp/test"));
        
System.out.println(jdbc.db().toList());
System.out.println(jdbc.db().entries().toList());
System.out.println(jdbc.db().value("people").toList());
System.out.println(jdbc.db().values("people").toList());
System.out.println(jdbc.db().values("people").value("name").toList());
System.out.println(jdbc.db().values("people").entries().toList());

This yields:

[<database#conn1: url=jdbc:h2:/tmp/test user=>]
[PEOPLE:<table#PEOPLE>]
[<table#people>]
[<row#PEOPLE:1>, <row#PEOPLE:2>]
[marko, josh]
[NAME:marko, AGE:29, NAME:josh, AGE:32]

The bytecode of the last query is:

[db(<database#conn1: url=jdbc:h2:/tmp/test user=>), values(people), entries]

JDBCDatabase implements TDatabase, Structure. 
	*** JDBCDatabase is the root structure and is referenced by db() *** (CRUCIAL POINT)
	
Assume another table called ADDRESSES with two columns: name and city.

jdbc.db().values(“people”).as(“x”).db().values(“addresses”).has(“name”,eq(path(“x”).by(“name”))).value(“city”)
	
The above is equivalent to:

SELECT city FROM people,addresses WHERE people.name=addresses.name

If you want to do an inner join (a product), you do this:

	jdbc.db().values(“people”).as(“x”).db().values(“addresses”).has(“name”,eq(path(“x”).by(“name”))).as(“y”).path(“x”,”y")

The above is equivalent to:

SELECT * FROM addresses INNER JOIN people ON people.name=addresses.name

NOTES:
	1. Instead of select(), we simply jump to the root via db() (or V() for graph).
	2. Instead of project(), we simply use value() or values().
	3. Instead of select() being overloaded with by() join syntax, we use has() and path().
		- like TP3 we will be smart about dropping path() data once its no longer referenced.
	4. We can also do LEFT and RIGHT JOINs (haven’t thought through FULL OUTER JOIN yet).
		- however, we don’t support ‘null' in TP so I don’t know if we want to support these null-producing joins. ?
	
LEFT JOIN:
	* If an address doesn’t exist for the person, emit a “null”-filled path.

jdbc.db().values(“people”).as(“x”).
  db().values(“addresses”).as(“y”).
    choose(has(“name”,eq(path(“x”).by(“name”))),
      identity(),
      path(“y”).by(null).as(“y”)).
  path(“x”,”y")

SELECT * FROM addresses LEFT JOIN people ON people.name=addresses.name

RIGHT JOIN:

jdbc.db().values(“people”).as(“x”).
  db().values(“addresses”).as(“y”).
    choose(has(“name”,eq(path(“x”).by(“name”))),
      identity(),
      path(“x”).by(null).as(“x”)).
  path(“x”,”y")


SUMMARY:

There are no “low level” instructions. Everything is based on the standard instructions that we know and love. Finally, if not apparent, the above bytecode chunks would ultimately get strategized into a single SQL query (breadth-first) instead of one-off queries (depth-first) to improve performance.

Neat?,
Marko.

http://rredux.com <http://rredux.com/>





Re: The Fundamental Structure Instructions Already Exist! (w/ RDBMS Example)

Posted by Joshua Shinavier <jo...@fortytwo.net>.
FWIW, your graphic did go through.

Having taken a closer look at this, I upgrade my +0.5 to a +1. As long as
we also have that "universal model instruction set" and can perform
analysis and optimizations there, before strategizing to the model-specific
instruction sets, we are golden. I didn't really grok from your previous
email that there is a chance to perform model-specific optimizations at a
separate layer of abstraction from the shared TP4 bytecode. Makes sense.

Josh


On Thu, May 2, 2019 at 12:29 PM Marko Rodriguez <ok...@gmail.com>
wrote:

> Hello,
>
> Please see the attached graphic that represents my previous email’s TP4 VM
> flow from language—to—>database.
>
> In case the mail server removes the attachment, I tweeted the pic here:
> https://twitter.com/twarko/status/1124031133056946177
>
> The difference between the different “instruction sets” is only the
> database-specific CRUD operations.
>
> - universal model: db(), add(), ...
> - property graph: pg:V(), pg:addV(), pg:outE(), pg:inV(), …
> - RDBMs: rdbms:R(), rdbms:addR(), …
> - RDF: rdf:T(), rdf:addT(), …
>
> All other instructions such as repeat(), has(), union(), count(), max(),
> group(), etc. are the same across the various instruction sets.
>
> *** Interesting side-note: batch-time processor like Spark and Hadoop are
> both processor and structure providers in one! This realization a few years
> back would have made the Spark/Giraph integration in TP3 much less
> cumbersome.
>
> Take care,
> Marko.
>
> http://rredux.com
>
>
>
>
>
> On May 2, 2019, at 7:40 AM, Marko Rodriguez <ok...@gmail.com> wrote:
>
> Hey Josh (others),
>
> I was thinking of our recent divergence in thought. I thought it would be
> smart for me to summarize where we are and to do my best to describe your
> model so as to better understand your perspective and to help you better
> understand how your model will ultimately execute on the TP4 VM.
>
> ############################
> # WHY A UNIVERSAL MODEL? #
> ###########################
>
> Every database data model can be losslessly embedded in every other
> database data model.
> - e.g. you can embed a property graph structure in a relational structure.
> - e.g. you can embed a document structure in a property graph structure.
> - e.g. you can embed a wide-column structure in a document structure.
> - …
> - e.g. you can embed a property graph structure in a Hadoop sequence file
> or Spark RDD.
>
> Thus, there exists a data model that can describe these database
> structures in a database agnostic manner.
> - not in terms of tables, vertices, JSON, column families, …
>
> While we call this a “universal model” it is NOT more “general”
> (theoretically powerful) than any other database structure.
>
> Reasons for creating a “universal model”.
>
> 1. To have a reduced set of objects for the TP4 VM to consider.
> - edges are just vertices with one incoming and outgoing “edge.”
> - a column family is just a “map” of rows which are just “maps.”
> - tables are just groupings of schema-equivalent rows.
> - …
> 2. To have a limited set of instructions in the TP4 bytecode specification.
> - outE/inE/outV/inV are just following direct “links” between objects.
> - has(), values(), keys(), valueMap(), etc. need not just apply to
> vertices and edges.
> - …
> 3. To have a simple serialization format.
> - we do not want to ship around rows/vertices/edges/documents/columns/etc.
> - we want to make it easy for other languages to integrate with the TP4 VM.
> - we want to make it easy to create TP4 VMs in other languages.
> - ...
> 4. To have a theoretical understanding of the relationship between the
> various data structures.
> - “this is just a that” is useful to limit the complexities of our
> codebase and explain to the public how different database relate.
>
> Without further ado...
>
> ########################
> # THE UNIVERSAL MODEL #
> ########################
>
> *** This is as I understand it. I will let Josh decide whether I captured
> his ideas correctly. ***
> *** All subsequent x().y().z() expressions are BYTECODE, not GREMLIN (just
> using an easier syntax then [op,arg*]*. ***
>
> The objects:
> 1. primitives: floats, doubles, Strings, ints, etc.
> 2. tuples: key’d collections of primitives. (instances)
> 3. relations: groupings of tuples with ?equivalent? schemas. (types)
>
> The instructions:
> 1. relations can be “queried” for matching tuples.
> 2. tuple values can be projected out to yield primitives.
>
> Lets do a “traversal” from marko to the people he knows.
>
> // g.V().has(‘name’,’marko’).outE(‘knows’).inV().values(‘name’)
>
> db(‘person’).has(‘name’,’marko’).as(‘x’).
> db(‘knows’).has(‘#outV’, path(‘x’).by(‘#id’)).as(‘y’).
> db(‘person’).has(‘#id’, path(‘y’).by(‘#inV’)).
>   values(‘name’)
>
>
> While the above is a single stream of processing, I will state what each
> line above has at that point in the stream.
> - [#label:person,name:marko,age:29]
> - [#label:knows,#outV:1,#inV:2,weight:0.5], ...
> - [#label:person,name:vadas,age:27], ...
> - vadas, ...
> Databases strategies can be smart to realize that only the #id or #inV or
> #outV of the previous object is required and thus, limit what is actually
> accessed and flow’d through the processing engine.
> - [#id:1]
> - [#id:0,#inV:2] …
> - [#id:2,name:vadas] …
> - vadas, ...
> *** More on such compiler optimizations (called strategies) later ***
>
> *POSITIVE NOTES:*
>
> 1. All relations are ‘siblings’ accessed via db().
> - There is no concept of nesting data. A very flat structure.
> 2. All subsequent has()/where()/is()/etc.-filter steps after db() define
> the pattern match query.
> - It is completely up to the database to determine how to retrieve
> matching tuples.
> - For example: using indices, pointer chasing, linear scans w/ filter, etc.
> 3. All subsequent map()/flatmap()/etc. steps are projections of data in
> the tuple.
> - The database returns key’d tuples composed of primitives.
> - Primitive data can be accessed and further processed. (projections)
> 4. The bytecode describes a computation that is irrespective of the
> underlying database’s encoding of that structure.
> - Amazon Neptune, MySQL, Cassandra, Spark, Hadoop, Ignite, etc. can be fed
> the same bytecode and will yield the same result.
> - In other words, given the example above. all databases can now process
> property graph traversals.
>
> *NEGATIVE NOTES:*
>
> 1. Every database has to have a concept of grouping similar tuples.
> 2. It implies an a priori definition of types (at least their existence
> and how to map data to them).
> 3. It implies a particular type of data model even though its represented
> using the “universal model."
> - the example above is a “property graph query” because of #outV, #inV,
> etc. schema’d keys.
> - the above example is a “vertex/edge-labeled property graph query”
>  because ‘person’ and ‘knows’ relations.
> - the above example implies that keys are unique to relations. (e.g.
> name=marko — why db(‘person’)?)
> - though db().has(‘name’,’marko’) can be used to search all relations.
> 4. It requires the use of path()-data.
> - though we could come up with an efficient traverser.last() which returns
> the previous object touched.
> - However, for multi-db() relation matches, as().path() will have to be
> used.
> - This can be optimized out by property graph databases as they support
> pointer chasing. (** more on this later **)
>
> We can relax ‘apriori’-typing to enable ’name’=‘marko’ to be in any
> relation group, not just people relations. Also, lets use the concept of
> last() from (4).
>
> // g.V().has(‘name’,’marko’).outE(‘knows’).inV().values(‘name’)
>
> db(‘vertices’).has(‘name’,’marko’).
> db(‘edges’).has(‘#label’,’knows’).has(‘#outV’, last().by(‘#id’)).
> db(‘vertices’).has(‘#label’,’person’).has(‘#id’, last().by(‘#inV’)).
> values(‘name’)
>
>
> We can make typing completely dynamic and thus, relation groups don’t
> exist in the “universal model.” Thus, databases don’t have to even have a
> concept of groups of relations. However, databases can have relation groups
> via “indices" on #type, #type+#label, etc.
>
> // g.V().has(‘name’,’marko’).outE(‘knows’).inV().values(‘name’)
>
> db().has(’#type’,’vertex’).has(‘name’,’marko’).
> db().has(‘#type’,’edge’).has(‘#label’,’knows’).has(‘#outV’,
> last().by(‘#id’)).
> db().has(‘#type’,’vertex’).has(‘#label’,’person’).has(‘#id’,
> last().by(‘#inV’)).values(‘name’)
>
>
> The above really states that we are dealing with an “vertex/edge-labeled
> property graph”. This is not bad, because we already had the problem of the
> existence of #inV/#outE/etc. so this isn’t any more limiting. Next, TP4
> bytecode is starting to look like SPARQL pattern matching. There are tuples
> and we are matching patterns where data in some tuple equals (or general
> predicate) data in another tuple, etc. The “universal model” is just a
> sequence of key’d tuples with variable keys and lengths! (like an n-tuple
> store).
>
> #############################################
> # TP4 VM EXECUTION OF THE UNIVERSAL MODEL #
> #############################################
>
> All integrating database providers must support the “universal model" db()
> instruction. Its easy to implement, but is inefficient because bytecode
> using that instruction require a bunch of back-and-forths of data from DB
> to TP4 VM. Thus, TP4 will provide strategies to map db().filter()*-bytecode
> (i.e. universal model instructions) to instructions that respect their
> native structure.
>
> Every database provider implements the TP4 interfaces that captures their
> native database encoding.
> - For example, RDBMS:
> https://github.com/apache/tinkerpop/tree/tp4/java/machine/machine-core/src/main/java/org/apache/tinkerpop/machine/structure/rdbms
> - For example, Property Graph:
> https://github.com/apache/tinkerpop/tree/tp4/java/machine/machine-core/src/main/java/org/apache/tinkerpop/machine/structure/graph
> - For example, RDF:
> https://github.com/apache/tinkerpop/tree/tp4/java/machine/machine-core/src/main/java/org/apache/tinkerpop/machine/structure/rdf
> - For example, Wide-Column…
> - For example, Document…
> - For example, HyperGraph…
> - etc.
> TP4 will have lots of these interface packages (which will also include
> compiler strategies and instructions).
> The db()-filter()* “universal model” bytecode is submitted to the TP4 VM.
> The TP4 VM looks at the integrated databases’ native structure (according
> to the interfaces it implements) and rewrites all db().filter()*-aspects of
> the submitted bytecode to a database-type specific instruction set that:
> 1. respects the semantics of the the underlying database encoding.
> 2. respects the semantics of TP4’s stream processing (i.e. linear/nested
> functions)
> For example, the previous “universal model" bytecode is rewritten for each
> database type as:
>
> Property graphs:
> pg:V().has(‘name’,’marko’).pg:outE(‘knows’).pg:inV().values(‘name’)
>
> RDBMS:
>   rdbms:R(‘person’).has(‘name’,’marko’)).
>     join(rdbms:R(‘knows’)).by(’#id’,eq(‘#outV’)).
>     join(rdbms:R(‘person’)).by(‘#inV’,eq(‘#id’)).values(‘name’)
> RDF:
>
>   rdf:T().has(’p’,’rdf:type’).has(‘o’,’foaf:Person’).as(‘a’).
>   rdf:
> T().has(’s’,path(‘a’).by(’s’)).has(‘p’,’foaf:name’).has(‘o’,’marko^^xsd:string’).
>   rdf:T().has(’s’,path(‘a').by(’s’)).has(‘p’,’#outE’).as(‘b’).
>   rdf:
> T().has(’s’,path(‘b').by(’o’)).has(‘p’,’rdf:type’).has(‘o’,’foaf:knows’).as(‘c’).
>   rdf:T().has(’s’,path(‘c’).by(‘o’)).has(‘p’,’#inV’).as(‘d’).
>   rdf:T().has(’s’,path(‘d’).by(‘o’)).has(‘p,’rdf:name’).values(‘o’)
>
>
> Next, TP4 will have strategies that can be generally applied to each
> database-type to further optimize the bytecode.
>
> Property graphs:
> pg:V().has(‘name’,’marko’).pg:out(‘knows’).values(‘name’)
>
> RDBMS:
> rdbms:sql(“SELECT name FROM person,knows,person WHERE p1.id=knows.inV …”)
> RDF:
> rdf:sparql(“SELECT ?e WHERE { ?x rdf:type foaf:Person. ?x foaf:name
> marko^^xsd …”)
>
>
> Finally, vendors can then apply their custom strategies. For instance, for
> JanusGraph:
>
>
> jg:v-index(’name’,’marko’,grab(‘out-edges')).jg:out(‘knows’,grab(‘in-vertex’,’name-property').values(‘name’)
>
>
> * The “universal model” instruction set must be supported by every
> database type. [*all databases*]
> * The database-type specific instructions (e.g. V(), sparql(), sql(),
> out(), etc.) are only required to be understood by databases that implement
> that type interface. [*database class*]
> * All  vendor-specific instructions (e.g. jg:v-index()) are only required
> to be understood by that particular database. [*database instance*]
>
> NOTES:
> 1. Optimizations such as sql(), sparql(), etc. are only for bytecode
> fragments that can be universally optimized for that particular class of
> databases.
> 2. Results from sql(), sparql(), etc. can be subjected to further TP4
> stream processing via repeat(), union(), choose(), etc. etc.
> - unfortunately my running example wasn’t complex enough to capture this.
> :(
> - the more we can pull out of TP4 bytecode and put into sql(), sparql(),
> etc. the better.
> - however, some query languages don’t have the respective expressivity for
> all types of computations (e.g. looping, branching, etc.).
> - in such situations, processing moves from DB to TP4 to DB to TP4
> accordingly.
> 3. We have an algorithmic way of mapping databases.
> - The RDBMS query shows there is a “property graph” encoded in tables.
> - The RDF query shows that there is a “property graph” encoded in triples.
>
> In summary:
>
> 1. There is a universal model and a universal instruction set.
> 2. Databases integrate with the TP4 VM via “native database
> type”-interfaces.
> 3. Submitted universal bytecode is rewritten to a database-type specific
> bytecode that respects the native semantics of that database-type. [*decoration
> strategies*]
> 4. TP4 can further strategize that bytecode to take advantage of
> optimizations that are universal to that database-type. [*optimization
> strategies*]
> 5. The underlying database can further strategize that bytecode to take
> unique advantage of their custom optimization features. [*provider
> strategies*]
>
> ################################
> # WHY GO TO ALL THIS TROUBLE? #
> ################################
>
> The million dollar question:
> * "Why would you want to encode an X data structure into a database that
> natively supports a Y data structure?”*
>
> Answer:
> 1. Its not just about databases, its about data formats in general.
> - The "universal model" allows database providers easy access to OLAP
> processors that have a different native structure than them.
> E.g. Spark RDDs, Hadoop SequenceFiles, Beam tuples, ...
> 2. In some scenarios, a Y-database is better at processing X-type data
> structure than the currently existing native X-databases.
> - E.g., JanusGraph is a successful graph database product that encodes a
> property graph in a wide-column store.
> - JanusGraph provides graph sharding, distributed read/write from OLAP
> processing, high-concurrency, fault tolerance, global distribution, etc.
> 3. Database providers can efficiently support other data structures that
> are simply "constrained versions" of their native structure.
> - E.g., Amazon Neptune can support RDF even if their native structure is
> Property Graph.
> - According to the “universal model,” RDF is a restriction on property
> graphs.
> - RDF is just no properties and URI-based identifiers.
> 4. “Agnostic” data(bases) such Redis, Ignite, Spark, etc. can easily
> support common data structures and their respective development communities.
> - With TP4, vendors can expand their product offering into communities
> they are only tangentially aware of.
> - E.g. Redis can immediately “jump into” the RDF space without having
> background knowledge of that space.
> - E.g. Ignite can immediately “jump into” the property graph space...
> - E.g. Spark can immediately “jump into” the document space…
> 5. All TP4-enabled processors automatically work over all TP4-enabled
> databases.
> - JanusGraph gets dynamic query routing with Akka.
> - Amazon Neptune gets multi-threaded query execution with rxJava.
> - ComosDB gets cluster-oriented OLAP query execution with Spark.
> - …
> 6. Language designers that have compilers to TP4 bytecode can work with
> all supporting TP4 databases/processors.
> - Neo4j no longer has to convince vendors to implement Cypher.
> - Amazon doesn’t have to choose between Gremlin, SPARQL, Cypher, etc.
> - Their customers can use their favorite language.
> - Obviously, some languages are better at expressing certain computations
> than others (e.g. SQL over graphs is horrible).
> - Some impedance mismatch issues can arise (e.g. RDF requires URIs for
> ids).
> - A plethora of new languages may emerge as designers don’t have to
> convince vendors to support it.
> - Language designers only have to develop a compiler to TP4 bytecode.
> And there you have it — I believe Apache TinkerPop is on the verge of
> offering a powerful new data(base) theory and technology.
>
> *The Database Virtual Machine*
>
> Thanks for reading,
> Marko.
>
> http://rredux.com
>
>
>
>
> On Apr 30, 2019, at 4:47 PM, Marko Rodriguez <ok...@gmail.com> wrote:
>
> Hello,
>
> First, the "root". While we do need context for traversals, I don't think
> there should be a distinct kind of root for each kind of structure. Once
> again, select(), or operations derived from select() will work just fine.
>
>
> So given your example below, “root” would be db in this case.
> db is the reference to the structure as a whole.
> Within db, substructures exist.
> Logically, this makes sense.
> For instance, a relational database’s references don’t leak outside the
> RDBMs into other areas of your computer’s memory.
> And there is always one entry point into every structure — the connection.
> And what does that connection point to:
> vertices, keyspaces, databases, document collections, etc.
> In other words, “roots.” (even the JVM has a “root” — it called the heap).
>
> Want the "person" table? db.select("person"). Want a sequence of vertices
> with the label "person"? db.select("person"). What we are saying in either
> case is "give me the 'person' relation. Don't project any specific fields;
> just give me all the data". A relational DB and a property graph DB will
> have different ways of supplying the relation, but in either case, it can
> hide behind the same interface (TRelation?).
>
>
> In your lexicon, for both RDBMS and graph:
> db.select(‘person’) is saying, select the people table (which is composed
> of a sequence of “person" rows)
> db.select(‘person’) is saying, select the person vertices (which is
> composed of a sequence of “person" vertices)
> …right off the bat you have the syntax-problem of people vs. person.
> Tables are typically named the plural of the rows. That
> doesn’t exist in graph databases as there is just one vertex set (i.e. one
> “table”).
>
> In my lexicon (TP instructions)
> db().values(‘people’) is saying, flatten out the person rows of the people
> table.
> V().has(label,’person’) is saying, flatten out the vertex objects of the
> graph’s vertices and filter out non-person vertices.
>
> Well, that is stupid, why not have the same syntax for both structures?
> Because they are different. There are no “person” relations in the classic
> property graph (Neo4j 1.0). There are only vertex relations with a
> label=person entry.
> In a relational database there are “person” relations and these are
> bundled into disjoint tables (i.e. relation sets — and schema constrained).
>
> The point I’m making is that instead of trying to fit all these data
> structures into a strict type system that ultimately looks like
> a bunch of disjoint relational sets, lets mimic the vendor-specified
> semantics. Lets take these systems at their face value
> and not try and “mathematize” them. If they are inconsistent and ugly,
> fine. If we map them into another system that is mathematical
> and beautiful, great. However, every data structure, from Neo4j’s
> representation for OLTP traversals
>  to that “same" data being OLAP processed as Spark RDDs or Hadoop
> SequenceFiles will all have their ‘oh shits’ (impedance mismatches) and
> that is okay. As this is the reality we are tying to model!
>
> Graph and RDBMs have two different data models (their unique worldview):
>
> *RDBMS*:   Databases->Tables->Rows->Primitives
> *GraphDB*: Vertices->Edges->Vertices->Edges->Vertices-> ...
>
>
> Here is a person->knows->person “traversal” in TP4 bytecode over an RDBMS
> (#key are ’symbols’ (constants)):
>
> db().values(“people”).as(“x”).
> db().values(“knows”).as(“y”).
>   where(“x”,eq(“y”)).by(#id).by(#outV).
> db().values(“people”).as(“z”).
>   where(“y”,eq(“z”)).by(#inV).by(#id)
>
>
> Pretty freakin’ disgusting, eh? Here is a person->knows->person
> “traversal” in TP4 bytecode over a property graph:
>
> V().has(#label,”person”).values(#outE).has(#label,”knows”).values(#inV)
>
>
> So we have two completely different bytecode representations for the same
> computational result. Why?
> Because we have two completely different data models!
>
> One is a set of disjoint typed-relations (i.e. RDBMS).
> One is a set of nested loosely-typed-relations (i.e. property graphs).
>
> Why not make them the same? Because they are not the same and that is
> exactly what I believe we should be capturing.
>
> Just looking at the two computations above you see that a relational
> database is doing “joins” while a graph database is doing “traversals”.
> We have to use path-data to compute a join. We have to use memory! (and we
> do). We don’t have to use path-data to compute a traversal.
> We don’t have to use memory! (and we don’t!). That is the fundamental
> nature of the respective computations that are taking place.
> That is what gives each system their particular style of computing.
>
> *NEXT*: There is nothing that says you can’t map between the two? Lets go
> property graph to RDBMS.
> - we could make a person table, a software table, a knows table, a created
> table.
> - that only works if the property graph is schema-based.
> - we could make a single vertex table with another 3 column properties
> table (vertexId,key,value)
> - we could…
> Which ever encoding you choose, a different bytecode will be required.
> Fortunately, the space of (reasonable) possibilities is constrained.
> Thus, instead of saying:
> “I want to map from property graph to RDBMS”
> I say:
> “I want to map from a recursive, bi-relational structure to a disjoint
> multi-relational structure where linkage is based on #id/#outV/#inV
> equalities.”
> Now you have constrained the space of possible RDBMS encodings! Moreover,
> we now have an algorithmic solution that not only disconnects “vertices,”
> but also rewrites the bytecode according to the new logical steps required
> to execute the computation as we have a new data structure and a new
> way of moving through that data structure. The pointers are completely
> different! However, as long as the mapping is sound, the rewrite should be
> algorithmic.
>
> I’m getting tired. I see your stuff below about indices and I have
> thoughts on that… but I will address those tomorrow.
>
> Thanks for reading,
> Marko.
>
> http://rredux.com
>
>
>
>
>
>
>
>
> But wait, you say, what if the under the hood, you have a TTable in one
> case, and TSequence in the other? They are so different! That's why
> the Dataflow
> Model
> <
> https://storage.googleapis.com/pub-tools-public-publication-data/pdf/43864.pdf
> >
> is so great; to an extent, you can think of the two as interchangeable. I
> think we would get a lot of mileage out of treating them as interchangeable
> within TP4.
>
> So instead of a data model -specific "root", I argue for a universal root
> together with a set of relations and what we might call an "indexes". An
> index is an arrow from a type to a relation which says "give me a
> column/value pair, and I will give you all matching tuples from this
> relation". The result is another relation. Where data sources differentiate
> themselves is by having different relations and indexes.
>
> For example, if the underlying data structure is nothing but a stream of
> Trip tuples, you will have a single relation "Trip", and no indexes. Sorry;
> you just have to wait for tuples to go by, and filter on them. So if you
> say d.select("Trip", "driver") -- where d is a traversal that gets you to a
> User -- the machine knows that it can't use "driver" to look up a specific
> set of trips; it has to use a filter over all future "Trip" tuples. If, on
> the other hand, we have a relational database, we have the option of
> indexing on "driver". In this case, d.select("Trip", "driver") may take you
> to a specific table like "Trip_by_driver" which has "driver" as a primary
> key. The machine recognizes that this index exists, and uses it to answer
> the query more efficiently. The alternative is to do a full scan over any
> table which contains the "Trip" relation. Since TinkerPop3, we have been
> without a vendor-neutral API for indexes, but this is where such an API
> would really start to shine. Consider Neo4j's single property indexes,
> JanusGraph's composite indexes, and even RDF triple indices (spo, ops,
> etc.) as in AllegroGraph in addition to primary keys in relational
> databases.
>
> TTuple -- cool. +1
>
> "Enums" -- I agree that enums are necessary, but we need even more: tagged
> unions <https://en.wikipedia.org/wiki/Tagged_union>. They are part of the
> system of algebraic data types which I described on Friday. An enum is a
> special case of a tagged union in which there is no value, just a type tag.
> May I suggest something like TValue, which contains a value (possibly
> trivial) together with a type tag. This enables ORs and pattern matching.
> For example, suppose "created" edges are allowed to point to either
> "Project" or "Document" vertices. The in-type of "created" is
> union{project:Project, document:Document). Now the in value of a specific
> edge can be TValue("project", [some project vertex]) or TValue("document",
> [some document vertex]) and you have the freedom to switch on the type tag
> if you want to, e.g. the next step in the traversal can give you the "name"
> of the project or the "title" of the document as appropriate.
>
> Multi-properties -- agreed; has() is good enough.
>
> Meta-properties -- again, this is where I think we should have a
> lower-level select() operation. Then has() builds on that operation.
> Whereas select() matches on fields of a relation, has() matches on property
> values and other higher-order things. If you want properties of properties,
> don't use has(); use select()/from(). Most of the time, you will just want
> to use has().
>
> Agreed that every *entity* should have an id(), and also a label() (though
> it should always be possible to infer label() from the context). I would
> suggest TEntity (or TElement), which has id(), label(), and value(), where
> value() provides the raw value (usually a TTuple) of the entity.
>
> Josh
>
>
>
> On Mon, Apr 29, 2019 at 10:35 AM Marko Rodriguez <ok...@gmail.com>
> wrote:
>
> Hello Josh,
>
> A has("age",29), for example, operates at a different level of
>
> abstraction than a
>
> has("city","Santa Fe") if "city" is a column in an "addresses" table.
>
>
> So hasXXX() operators work on TTuples. Thus:
>
> g.V().hasLabel(‘person’).has(‘age’,29)
> g.V().hasLabel(‘address’).has(‘city’,’Santa Fe’)
>
> ..both work as a person-vertex and an address-vertex are TTuples. If these
> were tables, then:
>
> jdbc.db().values(‘people’).has(‘age’,29)
> jdbc.db().values(‘addresses’).has(‘city’,’Santa Fe’)
>
> …also works as both people and addresses are TTables which extend
> TTuple<String,?>.
>
> In summary, its its a TTuple, then hasXXX() is good go.
>
> ////////// IGNORE UNTIL AFTER READING NEXT SECTION //////////
> *** SIDENOTE: A TTable (which is a TSequence) could have Symbol-based
> metadata. Thus TTable.value(#label) -> “people.” If so, then
> jdbc.db().hasLabel(“people”).has(“age”,29)
>
> At least, they
> are different if the data model allows for multi-properties,
> meta-properties, and hyper-edges. A property is something that can either
> be there, attached to an element, or not be there. There may also be more
> than one such property, and it may have other properties attached to it.
>
> A
>
> column of a table, on the other hand, is always there (even if its value
>
> is
>
> allowed to be null), always has a single value, and cannot have further
> properties attached.
>
>
> 1. Multi-properties.
>
> Multi-properties works because if name references a TSequence, then its
> the sequence that you analyze with has(). This is another reason why
> TSequence is important. Its a reference to a “stream” so there isn’t
> another layer of tuple-nesting.
>
> // assume v[1] has name={marko,mrodriguez,markor}
> g.V(1).value(‘name’) => TSequence<String>
> g.V(1).values(‘name’) => marko, mrodriguez, markor
> g.V(1).has(‘name’,’marko’) => v[1]
>
> 2. Meta-properties
>
> // assume v[1] has name=[value:marko,creator:josh,timestamp:12303] // i.e.
> a tuple value
> g.V(1).value(‘name’) => TTuple<?,String> // doh!
> g.V(1).value(‘name’).value(‘value’) => marko
> g.V(1).value(‘name’).value(‘creator’) => josh
>
> So things get screwy. — however, it only gets screwy when you mix your
> “metadata” key/values with your “data” key/values. This is why I think
> TSymbols are important. Imagine the following meta-property tuple for v[1]:
>
> [#value:marko,creator:josh,timestamp:12303]
>
> If you do g.V(1).value(‘name’), we could look to the value indexed by the
> symbol #value, thus => “marko”.
> If you do g.V(1).values(‘name’), you would get back a TSequence with a
> single TTuple being the meta property.
> If you do g.V(1).values(‘name’).value(), we could get the value indexed by
> the symbol #value.
> If you do g.V(1).values(‘name’).value(‘creator’), it will return the
> primitive string “josh”.
>
> I believe that the following symbols should be recommended for use across
> all data structures.
>        #id, #label, #key, #value
> …where id(), label(), key(), value() are tuple.get(Symbol). Other symbols
> for use with propertygraph/ include:
>        #outE, #inV, #inE, #outV, #bothE, #bothV
>
> In order to simplify user queries, you can let has() and values() do
>
> double
>
> duty, but I still feel that there are lower-level operations at play, at
>
> a
>
> logical level even if not at a bytecode level. However, expressing the a
> traversal in terms of its lowest-level relational operations may also be
> useful for query optimization.
>
>
> One thing that I’m doing, that perhaps you haven’t caught onto yet, is
> that I’m not modeling everything in terms of “tables.” Each data structure
> is trying to stay as pure to its conceptual model as possible. Thus, there
> are no “joins” in property graphs as outE() references a TSequence<TEdge>,
> where TEdge is an interface that extends TTuple. You can just walk without
> doing any type of INNER JOIN. Now, if you model a property graph in a
> relational database, you will have to strategize the bytecode accordingly!
> Just a heads up in case you haven’t noticed that.
>
> Thanks for your input,
> Marko.
>
> http://rredux.com <http://rredux.com/>
>
>
>
>
> Josh
>
>
>
> On Mon, Apr 29, 2019 at 7:34 AM Marko Rodriguez <okrammarko@gmail.com
>
> <mailto:okrammarko@gmail.com <ok...@gmail.com>>>
>
> wrote:
>
> Hi,
>
> *** This email is primarily for Josh (and Kuppitz). However, if others
>
> are
>
> interested… ***
>
> So I did a lot of thinking this weekend about structure/ and this
>
> morning,
>
> I prototyped both graph/ and rdbms/.
>
> This is the way I’m currently thinking of things:
>
>       1. There are 4 base types in structure/.
>               - Primitive: string, long, float, int, … (will constrain
> these at some point).
>               - TTuple<K,V>: key/value map.
>               - TSequence<V>: an iterable of v objects.
>               - TSymbol: like Ruby, I think we need “enum-like” symbols
> (e.g., #id, #label).
>
>       2. Every structure has a “root.”
>               - for graph its TGraph implements TSequence<TVertex>
>               - for rdbms its a TDatabase implements
> TTuple<String,TTable>
>
>       3. Roots implement Structure and thus, are what is generated by
> StructureFactory.mint().
>               - defined using withStructure().
>               - For graph, its accessible via V().
>               - For rdbms, its accessible via db().
>
>       4. There is a list of core instructions for dealing with these
> base objects.
>               - value(K key): gets the TTuple value for the provided
>
> key.
>
>               - values(K key): gets an iterator of the value for the
> provided key.
>               - entries(): gets an iterator of T2Tuple objects for the
> incoming TTuple.
>               - hasXXX(A,B): various has()-based filters for looking
> into a TTuple and a TSequence
>               - db()/V()/etc.: jump to the “root” of the
>
> withStructure()
>
> structure.
>               - drop()/add(): behave as one would expect and thus.
>
> ————
>
> For RDBMS, we have three interfaces in rdbms/.
> (machine/machine-core/structure/rdbms)
>
>       1. TDatabase implements TTuple<String,TTable> // the root
> structure that indexes the tables.
>       2. TTable implements TSequence<TRow<?>> // a table is a sequence
> of rows
>       3. TRow<V> implements TTuple<String,V>> // a row has string
>
> column
>
> names
>
> I then created a new project at machine/structure/jdbc). The classes in
> here implement the above rdbms/ interfaces/
>
> Here is an RDBMS session:
>
> final Machine machine = LocalMachine.open();
> final TraversalSource jdbc =
>       Gremlin.traversal(machine).
>                       withProcessor(PipesProcessor.class).
>                       withStructure(JDBCStructure.class,
> Map.of(JDBCStructure.JDBC_CONNECTION, "jdbc:h2:/tmp/test"));
>
> System.out.println(jdbc.db().toList());
> System.out.println(jdbc.db().entries().toList());
> System.out.println(jdbc.db().value("people").toList());
> System.out.println(jdbc.db().values("people").toList());
> System.out.println(jdbc.db().values("people").value("name").toList());
> System.out.println(jdbc.db().values("people").entries().toList());
>
> This yields:
>
> [<database#conn1: url=jdbc:h2:/tmp/test user=>]
> [PEOPLE:<table#PEOPLE>]
> [<table#people>]
> [<row#PEOPLE:1>, <row#PEOPLE:2>]
> [marko, josh]
> [NAME:marko, AGE:29, NAME:josh, AGE:32]
>
> The bytecode of the last query is:
>
> [db(<database#conn1: url=jdbc:h2:/tmp/test user=>), values(people),
> entries]
>
> JDBCDatabase implements TDatabase, Structure.
>       *** JDBCDatabase is the root structure and is referenced by db()
> *** (CRUCIAL POINT)
>
> Assume another table called ADDRESSES with two columns: name and city.
>
>
>
>
> jdbc.db().values(“people”).as(“x”).db().values(“addresses”).has(“name”,eq(path(“x”).by(“name”))).value(“city”)
>
>
> The above is equivalent to:
>
> SELECT city FROM people,addresses WHERE people.name=addresses.name
>
> If you want to do an inner join (a product), you do this:
>
>
>
>
> jdbc.db().values(“people”).as(“x”).db().values(“addresses”).has(“name”,eq(path(“x”).by(“name”))).as(“y”).path(“x”,”y")
>
>
> The above is equivalent to:
>
> SELECT * FROM addresses INNER JOIN people ON people.name=addresses.name
>
> NOTES:
>       1. Instead of select(), we simply jump to the root via db() (or
> V() for graph).
>       2. Instead of project(), we simply use value() or values().
>       3. Instead of select() being overloaded with by() join syntax, we
> use has() and path().
>               - like TP3 we will be smart about dropping path() data
> once its no longer referenced.
>       4. We can also do LEFT and RIGHT JOINs (haven’t thought through
> FULL OUTER JOIN yet).
>               - however, we don’t support ‘null' in TP so I don’t know
> if we want to support these null-producing joins. ?
>
> LEFT JOIN:
>       * If an address doesn’t exist for the person, emit a
>
> “null”-filled
>
> path.
>
> jdbc.db().values(“people”).as(“x”).
> db().values(“addresses”).as(“y”).
>   choose(has(“name”,eq(path(“x”).by(“name”))),
>     identity(),
>     path(“y”).by(null).as(“y”)).
> path(“x”,”y")
>
> SELECT * FROM addresses LEFT JOIN people ON people.name=addresses.name
>
> RIGHT JOIN:
>
> jdbc.db().values(“people”).as(“x”).
> db().values(“addresses”).as(“y”).
>   choose(has(“name”,eq(path(“x”).by(“name”))),
>     identity(),
>     path(“x”).by(null).as(“x”)).
> path(“x”,”y")
>
>
> SUMMARY:
>
> There are no “low level” instructions. Everything is based on the
>
> standard
>
> instructions that we know and love. Finally, if not apparent, the above
> bytecode chunks would ultimately get strategized into a single SQL query
> (breadth-first) instead of one-off queries (depth-first) to improve
> performance.
>
> Neat?,
> Marko.
>
> http://rredux.com <http://rredux.com/> <http://rredux.com/ <
>
> http://rredux.com/>>
>
>
>
>
>

Re: The Fundamental Structure Instructions Already Exist! (w/ RDBMS Example)

Posted by Marko Rodriguez <ok...@gmail.com>.
Hello,

Please see the attached graphic that represents my previous email’s TP4 VM flow from language—to—>database.

In case the mail server removes the attachment, I tweeted the pic here:
	https://twitter.com/twarko/status/1124031133056946177 <https://twitter.com/twarko/status/1124031133056946177>

The difference between the different “instruction sets” is only the database-specific CRUD operations.

	- universal model: db(), add(), ...
	- property graph: pg:V(), pg:addV(), pg:outE(), pg:inV(), …
	- RDBMs: rdbms:R(), rdbms:addR(), …
	- RDF: rdf:T(), rdf:addT(), …

All other instructions such as repeat(), has(), union(), count(), max(), group(), etc. are the same across the various instruction sets.

*** Interesting side-note: batch-time processor like Spark and Hadoop are both processor and structure providers in one! This realization a few years back would have made the Spark/Giraph integration in TP3 much less cumbersome.

Take care,
Marko.

http://rredux.com <http://rredux.com/>





> On May 2, 2019, at 7:40 AM, Marko Rodriguez <ok...@gmail.com> wrote:
> 
> Hey Josh (others),
> 
> I was thinking of our recent divergence in thought. I thought it would be smart for me to summarize where we are and to do my best to describe your model so as to better understand your perspective and to help you better understand how your model will ultimately execute on the TP4 VM.
> 
> ############################
> # WHY A UNIVERSAL MODEL? #
> ###########################
> 
> Every database data model can be losslessly embedded in every other database data model.
> 	- e.g. you can embed a property graph structure in a relational structure.
> 	- e.g. you can embed a document structure in a property graph structure.
> 	- e.g. you can embed a wide-column structure in a document structure.
> 	- …
> 	- e.g. you can embed a property graph structure in a Hadoop sequence file or Spark RDD.
> 
> Thus, there exists a data model that can describe these database structures in a database agnostic manner.
> 	- not in terms of tables, vertices, JSON, column families, …
> 
> While we call this a “universal model” it is NOT more “general” (theoretically powerful) than any other database structure.
> 
> Reasons for creating a “universal model”.
> 
> 	1. To have a reduced set of objects for the TP4 VM to consider.
> 		- edges are just vertices with one incoming and outgoing “edge.”
> 		- a column family is just a “map” of rows which are just “maps.”
> 		- tables are just groupings of schema-equivalent rows.
> 		- …
> 	2. To have a limited set of instructions in the TP4 bytecode specification.
> 		- outE/inE/outV/inV are just following direct “links” between objects.
> 		- has(), values(), keys(), valueMap(), etc. need not just apply to vertices and edges.
> 		- …
> 	3. To have a simple serialization format.
> 		- we do not want to ship around rows/vertices/edges/documents/columns/etc.
> 		- we want to make it easy for other languages to integrate with the TP4 VM.
> 		- we want to make it easy to create TP4 VMs in other languages.
> 		- ...
> 	4. To have a theoretical understanding of the relationship between the various data structures.
> 		- “this is just a that” is useful to limit the complexities of our codebase and explain to the public how different database relate.
> 
> Without further ado...
> 
> ########################
> # THE UNIVERSAL MODEL #
> ########################
> 
> *** This is as I understand it. I will let Josh decide whether I captured his ideas correctly. ***
> *** All subsequent x().y().z() expressions are BYTECODE, not GREMLIN (just using an easier syntax then [op,arg*]*. ***
> 
> The objects:
> 	1. primitives: floats, doubles, Strings, ints, etc.
> 	2. tuples: key’d collections of primitives. (instances)
> 	3. relations: groupings of tuples with ?equivalent? schemas. (types)
> 
> The instructions:
> 	1. relations can be “queried” for matching tuples.
> 	2. tuple values can be projected out to yield primitives.
> 
> Lets do a “traversal” from marko to the people he knows.
> 
> // g.V().has(‘name’,’marko’).outE(‘knows’).inV().values(‘name’)
> 
> db(‘person’).has(‘name’,’marko’).as(‘x’).
> db(‘knows’).has(‘#outV’, path(‘x’).by(‘#id’)).as(‘y’).
> db(‘person’).has(‘#id’, path(‘y’).by(‘#inV’)).
>   values(‘name’)
> 
> While the above is a single stream of processing, I will state what each line above has at that point in the stream.
> 	- [#label:person,name:marko,age:29]
> 	- [#label:knows,#outV:1,#inV:2,weight:0.5], ...
> 	- [#label:person,name:vadas,age:27], ...
> 	- vadas, ...
> Databases strategies can be smart to realize that only the #id or #inV or #outV of the previous object is required and thus, limit what is actually accessed and flow’d through the processing engine.
> 	- [#id:1]
> 	- [#id:0,#inV:2] …
> 	- [#id:2,name:vadas] …
> 	- vadas, ...
> *** More on such compiler optimizations (called strategies) later ***
> 
> POSITIVE NOTES:
> 
> 	1. All relations are ‘siblings’ accessed via db().
> 		- There is no concept of nesting data. A very flat structure.
> 	2. All subsequent has()/where()/is()/etc.-filter steps after db() define the pattern match query.
> 		- It is completely up to the database to determine how to retrieve matching tuples.
> 		- For example: using indices, pointer chasing, linear scans w/ filter, etc.
> 	3. All subsequent map()/flatmap()/etc. steps are projections of data in the tuple.
> 		- The database returns key’d tuples composed of primitives.
> 		- Primitive data can be accessed and further processed. (projections)
> 	4. The bytecode describes a computation that is irrespective of the underlying database’s encoding of that structure.
> 		- Amazon Neptune, MySQL, Cassandra, Spark, Hadoop, Ignite, etc. can be fed the same bytecode and will yield the same result.
> 		- In other words, given the example above. all databases can now process property graph traversals.
> 
> NEGATIVE NOTES:
> 
> 	1. Every database has to have a concept of grouping similar tuples.
> 	2. It implies an a priori definition of types (at least their existence and how to map data to them).
> 	3. It implies a particular type of data model even though its represented using the “universal model."
> 		- the example above is a “property graph query” because of #outV, #inV, etc. schema’d keys.
> 		- the above example is a “vertex/edge-labeled property graph query”  because ‘person’ and ‘knows’ relations.
> 		- the above example implies that keys are unique to relations. (e.g. name=marko — why db(‘person’)?)
> 			- though db().has(‘name’,’marko’) can be used to search all relations.
> 	4. It requires the use of path()-data.
> 		- though we could come up with an efficient traverser.last() which returns the previous object touched.
> 		- However, for multi-db() relation matches, as().path() will have to be used.
> 			- This can be optimized out by property graph databases as they support pointer chasing. (** more on this later **)
> 
> We can relax ‘apriori’-typing to enable ’name’=‘marko’ to be in any relation group, not just people relations. Also, lets use the concept of last() from (4). 
> 
> // g.V().has(‘name’,’marko’).outE(‘knows’).inV().values(‘name’)
> 
> db(‘vertices’).has(‘name’,’marko’).
> db(‘edges’).has(‘#label’,’knows’).has(‘#outV’, last().by(‘#id’)).
> db(‘vertices’).has(‘#label’,’person’).has(‘#id’, last().by(‘#inV’)).values(‘name’)
> 
> We can make typing completely dynamic and thus, relation groups don’t exist in the “universal model.” Thus, databases don’t have to even have a concept of groups of relations. However, databases can have relation groups via “indices" on #type, #type+#label, etc.
> 
> // g.V().has(‘name’,’marko’).outE(‘knows’).inV().values(‘name’)
> 
> db().has(’#type’,’vertex’).has(‘name’,’marko’).
> db().has(‘#type’,’edge’).has(‘#label’,’knows’).has(‘#outV’, last().by(‘#id’)).
> db().has(‘#type’,’vertex’).has(‘#label’,’person’).has(‘#id’, last().by(‘#inV’)).values(‘name’)
> 
> The above really states that we are dealing with an “vertex/edge-labeled property graph”. This is not bad, because we already had the problem of the existence of #inV/#outE/etc. so this isn’t any more limiting. Next, TP4 bytecode is starting to look like SPARQL pattern matching. There are tuples and we are matching patterns where data in some tuple equals (or general predicate) data in another tuple, etc. The “universal model” is just a sequence of key’d tuples with variable keys and lengths! (like an n-tuple store).
> 
> #############################################
> # TP4 VM EXECUTION OF THE UNIVERSAL MODEL #
> #############################################
> 
> All integrating database providers must support the “universal model" db() instruction. Its easy to implement, but is inefficient because bytecode using that instruction require a bunch of back-and-forths of data from DB to TP4 VM. Thus, TP4 will provide strategies to map db().filter()*-bytecode (i.e. universal model instructions) to instructions that respect their native structure.
> 
> Every database provider implements the TP4 interfaces that captures their native database encoding.
> 	- For example, RDBMS: https://github.com/apache/tinkerpop/tree/tp4/java/machine/machine-core/src/main/java/org/apache/tinkerpop/machine/structure/rdbms <https://github.com/apache/tinkerpop/tree/tp4/java/machine/machine-core/src/main/java/org/apache/tinkerpop/machine/structure/rdbms>
> 	- For example, Property Graph: https://github.com/apache/tinkerpop/tree/tp4/java/machine/machine-core/src/main/java/org/apache/tinkerpop/machine/structure/graph <https://github.com/apache/tinkerpop/tree/tp4/java/machine/machine-core/src/main/java/org/apache/tinkerpop/machine/structure/graph>
> 	- For example, RDF: https://github.com/apache/tinkerpop/tree/tp4/java/machine/machine-core/src/main/java/org/apache/tinkerpop/machine/structure/rdf <https://github.com/apache/tinkerpop/tree/tp4/java/machine/machine-core/src/main/java/org/apache/tinkerpop/machine/structure/rdf>
> 	- For example, Wide-Column…
> 	- For example, Document…
> 	- For example, HyperGraph…
> 	- etc.
> TP4 will have lots of these interface packages (which will also include compiler strategies and instructions).
> 	
> The db()-filter()* “universal model” bytecode is submitted to the TP4 VM. The TP4 VM looks at the integrated databases’ native structure (according to the interfaces it implements) and rewrites all db().filter()*-aspects of the submitted bytecode to a database-type specific instruction set that:
> 	1. respects the semantics of the the underlying database encoding.
> 	2. respects the semantics of TP4’s stream processing (i.e. linear/nested functions)
> For example, the previous “universal model" bytecode is rewritten for each database type as:
> 
> Property graphs:
> 	pg:V().has(‘name’,’marko’).pg:outE(‘knows’).pg:inV().values(‘name’)
> 
> RDBMS:
>   rdbms:R(‘person’).has(‘name’,’marko’)).
>     join(rdbms:R(‘knows’)).by(’#id’,eq(‘#outV’)).
>     join(rdbms:R(‘person’)).by(‘#inV’,eq(‘#id’)).values(‘name’)
> 	
> RDF:
>   rdf:T().has(’p’,’rdf:type’).has(‘o’,’foaf:Person’).as(‘a’).
>   rdf:T().has(’s’,path(‘a’).by(’s’)).has(‘p’,’foaf:name’).has(‘o’,’marko^^xsd:string’).
>   rdf:T().has(’s’,path(‘a').by(’s’)).has(‘p’,’#outE’).as(‘b’).
>   rdf:T().has(’s’,path(‘b').by(’o’)).has(‘p’,’rdf:type’).has(‘o’,’foaf:knows’).as(‘c’).
>   rdf:T().has(’s’,path(‘c’).by(‘o’)).has(‘p’,’#inV’).as(‘d’).
>   rdf:T().has(’s’,path(‘d’).by(‘o’)).has(‘p,’rdf:name’).values(‘o’)
> 
> Next, TP4 will have strategies that can be generally applied to each database-type to further optimize the bytecode.
> 
> Property graphs:
> 	pg:V().has(‘name’,’marko’).pg:out(‘knows’).values(‘name’)
> 
> RDBMS:
> 	rdbms:sql(“SELECT name FROM person,knows,person WHERE p1.id=knows.inV …”)
> 	
> RDF:
> 	rdf:sparql(“SELECT ?e WHERE { ?x rdf:type foaf:Person. ?x foaf:name marko^^xsd …”)
> 
> Finally, vendors can then apply their custom strategies. For instance, for JanusGraph:
> 
> jg:v-index(’name’,’marko’,grab(‘out-edges')).jg:out(‘knows’,grab(‘in-vertex’,’name-property').values(‘name’)
> 
> * The “universal model” instruction set must be supported by every database type. [all databases]
> * The database-type specific instructions (e.g. V(), sparql(), sql(), out(), etc.) are only required to be understood by databases that implement that type interface. [database class]
> * All  vendor-specific instructions (e.g. jg:v-index()) are only required to be understood by that particular database. [database instance]
> 
> NOTES:
> 	1. Optimizations such as sql(), sparql(), etc. are only for bytecode fragments that can be universally optimized for that particular class of databases.
> 	2. Results from sql(), sparql(), etc. can be subjected to further TP4 stream processing via repeat(), union(), choose(), etc. etc.
> 		- unfortunately my running example wasn’t complex enough to capture this. :(
> 		- the more we can pull out of TP4 bytecode and put into sql(), sparql(), etc. the better.
> 		- however, some query languages don’t have the respective expressivity for all types of computations (e.g. looping, branching, etc.).
> 			- in such situations, processing moves from DB to TP4 to DB to TP4 accordingly.
> 	3. We have an algorithmic way of mapping databases.
> 		- The RDBMS query shows there is a “property graph” encoded in tables.
> 		- The RDF query shows that there is a “property graph” encoded in triples.
> 
> In summary:
> 
> 	1. There is a universal model and a universal instruction set.
> 	2. Databases integrate with the TP4 VM via “native database type”-interfaces.
> 	3. Submitted universal bytecode is rewritten to a database-type specific bytecode that respects the native semantics of that database-type. [decoration strategies]
> 	4. TP4 can further strategize that bytecode to take advantage of optimizations that are universal to that database-type. [optimization strategies]
> 	5. The underlying database can further strategize that bytecode to take unique advantage of their custom optimization features. [provider strategies]
> 
> ################################
> # WHY GO TO ALL THIS TROUBLE? #
> ################################
> 
> The million dollar question:
> 	
> 	"Why would you want to encode an X data structure into a database that natively supports a Y data structure?”
> 
> Answer:
> 	1. Its not just about databases, its about data formats in general.
> 		- The "universal model" allows database providers easy access to OLAP processors that have a different native structure than them.
> 			E.g. Spark RDDs, Hadoop SequenceFiles, Beam tuples, ...
> 	2. In some scenarios, a Y-database is better at processing X-type data structure than the currently existing native X-databases.
> 		- E.g., JanusGraph is a successful graph database product that encodes a property graph in a wide-column store.
> 			- JanusGraph provides graph sharding, distributed read/write from OLAP processing, high-concurrency, fault tolerance, global distribution, etc.
> 	3. Database providers can efficiently support other data structures that are simply "constrained versions" of their native structure. 
> 		- E.g., Amazon Neptune can support RDF even if their native structure is Property Graph.
> 			- According to the “universal model,” RDF is a restriction on property graphs.
> 				- RDF is just no properties and URI-based identifiers.
> 	4. “Agnostic” data(bases) such Redis, Ignite, Spark, etc. can easily support common data structures and their respective development communities.
> 		- With TP4, vendors can expand their product offering into communities they are only tangentially aware of.
> 			- E.g. Redis can immediately “jump into” the RDF space without having background knowledge of that space.
> 			- E.g. Ignite can immediately “jump into” the property graph space...
> 			- E.g. Spark can immediately “jump into” the document space…
> 	5. All TP4-enabled processors automatically work over all TP4-enabled databases.
> 		- JanusGraph gets dynamic query routing with Akka.
> 		- Amazon Neptune gets multi-threaded query execution with rxJava.
> 		- ComosDB gets cluster-oriented OLAP query execution with Spark.
> 		- …
> 	6. Language designers that have compilers to TP4 bytecode can work with all supporting TP4 databases/processors.
> 		- Neo4j no longer has to convince vendors to implement Cypher.
> 		- Amazon doesn’t have to choose between Gremlin, SPARQL, Cypher, etc.
> 			- Their customers can use their favorite language.
> 				- Obviously, some languages are better at expressing certain computations than others (e.g. SQL over graphs is horrible).
> 				- Some impedance mismatch issues can arise (e.g. RDF requires URIs for ids).
> 		- A plethora of new languages may emerge as designers don’t have to convince vendors to support it.
> 			- Language designers only have to develop a compiler to TP4 bytecode.
> 		
> And there you have it — I believe Apache TinkerPop is on the verge of offering a powerful new data(base) theory and technology.
> 
> 	The Database Virtual Machine
> 
> Thanks for reading,
> Marko.
> 
> http://rredux.com <http://rredux.com/>
> 
> 
> 
> 
>> On Apr 30, 2019, at 4:47 PM, Marko Rodriguez <okrammarko@gmail.com <ma...@gmail.com>> wrote:
>> 
>> Hello,
>> 
>>> First, the "root". While we do need context for traversals, I don't think
>>> there should be a distinct kind of root for each kind of structure. Once
>>> again, select(), or operations derived from select() will work just fine.
>> 
>> So given your example below, “root” would be db in this case. 
>> db is the reference to the structure as a whole.
>> Within db, substructures exist. 
>> Logically, this makes sense.
>> For instance, a relational database’s references don’t leak outside the RDBMs into other areas of your computer’s memory.
>> And there is always one entry point into every structure — the connection. And what does that connection point to:
>> 	vertices, keyspaces, databases, document collections, etc. 
>> In other words, “roots.” (even the JVM has a “root” — it called the heap).
>> 
>>> Want the "person" table? db.select("person"). Want a sequence of vertices
>>> with the label "person"? db.select("person"). What we are saying in either
>>> case is "give me the 'person' relation. Don't project any specific fields;
>>> just give me all the data". A relational DB and a property graph DB will
>>> have different ways of supplying the relation, but in either case, it can
>>> hide behind the same interface (TRelation?).
>> 
>> In your lexicon, for both RDBMS and graph:
>> 	db.select(‘person’) is saying, select the people table (which is composed of a sequence of “person" rows)
>> 	db.select(‘person’) is saying, select the person vertices (which is composed of a sequence of “person" vertices)
>> …right off the bat you have the syntax-problem of people vs. person. Tables are typically named the plural of the rows. That
>> doesn’t exist in graph databases as there is just one vertex set (i.e. one “table”).
>> 
>> In my lexicon (TP instructions)
>> 	db().values(‘people’) is saying, flatten out the person rows of the people table.
>> 	V().has(label,’person’) is saying, flatten out the vertex objects of the graph’s vertices and filter out non-person vertices.
>> 
>> Well, that is stupid, why not have the same syntax for both structures?
>> Because they are different. There are no “person” relations in the classic property graph (Neo4j 1.0). There are only vertex relations with a label=person entry.
>> In a relational database there are “person” relations and these are bundled into disjoint tables (i.e. relation sets — and schema constrained).
>> 
>> The point I’m making is that instead of trying to fit all these data structures into a strict type system that ultimately looks like
>> a bunch of disjoint relational sets, lets mimic the vendor-specified semantics. Lets take these systems at their face value
>> and not try and “mathematize” them. If they are inconsistent and ugly, fine. If we map them into another system that is mathematical
>> and beautiful, great. However, every data structure, from Neo4j’s representation for OLTP traversals
>>  to that “same" data being OLAP processed as Spark RDDs or Hadoop
>> SequenceFiles will all have their ‘oh shits’ (impedance mismatches) and that is okay. As this is the reality we are tying to model!
>> 
>> Graph and RDBMs have two different data models (their unique worldview):
>> 
>> RDBMS:   Databases->Tables->Rows->Primitives
>> GraphDB: Vertices->Edges->Vertices->Edges->Vertices-> ...
>> 
>> Here is a person->knows->person “traversal” in TP4 bytecode over an RDBMS (#key are ’symbols’ (constants)):
>> 
>> db().values(“people”).as(“x”).
>> db().values(“knows”).as(“y”).
>>   where(“x”,eq(“y”)).by(#id).by(#outV).
>> db().values(“people”).as(“z”).
>>   where(“y”,eq(“z”)).by(#inV).by(#id)
>>    
>> Pretty freakin’ disgusting, eh? Here is a person->knows->person “traversal” in TP4 bytecode over a property graph:
>> 
>> V().has(#label,”person”).values(#outE).has(#label,”knows”).values(#inV)
>> 
>> So we have two completely different bytecode representations for the same computational result. Why?
>> Because we have two completely different data models!
>> 
>> 	One is a set of disjoint typed-relations (i.e. RDBMS).
>> 	One is a set of nested loosely-typed-relations (i.e. property graphs).
>> 
>> Why not make them the same? Because they are not the same and that is exactly what I believe we should be capturing.
>> 
>> Just looking at the two computations above you see that a relational database is doing “joins” while a graph database is doing “traversals”.
>> We have to use path-data to compute a join. We have to use memory! (and we do). We don’t have to use path-data to compute a traversal.
>> We don’t have to use memory! (and we don’t!). That is the fundamental nature of the respective computations that are taking place.
>> That is what gives each system their particular style of computing.
>> 
>> NEXT: There is nothing that says you can’t map between the two? Lets go property graph to RDBMS.
>> 	- we could make a person table, a software table, a knows table, a created table.
>> 		- that only works if the property graph is schema-based.
>> 	- we could make a single vertex table with another 3 column properties table (vertexId,key,value)
>> 	- we could…
>> Which ever encoding you choose, a different bytecode will be required. Fortunately, the space of (reasonable) possibilities is constrained.
>> Thus, instead of saying: 
>> 	“I want to map from property graph to RDBMS” 
>> I say: 
>> 	“I want to map from a recursive, bi-relational structure to a disjoint multi-relational structure where linkage is based on #id/#outV/#inV equalities.”
>> Now you have constrained the space of possible RDBMS encodings! Moreover, we now have an algorithmic solution that not only disconnects “vertices,” 
>> but also rewrites the bytecode according to the new logical steps required to execute the computation as we have a new data structure and a new
>> way of moving through that data structure. The pointers are completely different! However, as long as the mapping is sound, the rewrite should be algorithmic.
>> 
>> I’m getting tired. I see your stuff below about indices and I have thoughts on that… but I will address those tomorrow.
>> 
>> Thanks for reading,
>> Marko.
>> 
>> http://rredux.com <http://rredux.com/>
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>>> 
>>> But wait, you say, what if the under the hood, you have a TTable in one
>>> case, and TSequence in the other? They are so different! That's why
>>> the Dataflow
>>> Model
>>> <https://storage.googleapis.com/pub-tools-public-publication-data/pdf/43864.pdf <https://storage.googleapis.com/pub-tools-public-publication-data/pdf/43864.pdf>>
>>> is so great; to an extent, you can think of the two as interchangeable. I
>>> think we would get a lot of mileage out of treating them as interchangeable
>>> within TP4.
>>> 
>>> So instead of a data model -specific "root", I argue for a universal root
>>> together with a set of relations and what we might call an "indexes". An
>>> index is an arrow from a type to a relation which says "give me a
>>> column/value pair, and I will give you all matching tuples from this
>>> relation". The result is another relation. Where data sources differentiate
>>> themselves is by having different relations and indexes.
>>> 
>>> For example, if the underlying data structure is nothing but a stream of
>>> Trip tuples, you will have a single relation "Trip", and no indexes. Sorry;
>>> you just have to wait for tuples to go by, and filter on them. So if you
>>> say d.select("Trip", "driver") -- where d is a traversal that gets you to a
>>> User -- the machine knows that it can't use "driver" to look up a specific
>>> set of trips; it has to use a filter over all future "Trip" tuples. If, on
>>> the other hand, we have a relational database, we have the option of
>>> indexing on "driver". In this case, d.select("Trip", "driver") may take you
>>> to a specific table like "Trip_by_driver" which has "driver" as a primary
>>> key. The machine recognizes that this index exists, and uses it to answer
>>> the query more efficiently. The alternative is to do a full scan over any
>>> table which contains the "Trip" relation. Since TinkerPop3, we have been
>>> without a vendor-neutral API for indexes, but this is where such an API
>>> would really start to shine. Consider Neo4j's single property indexes,
>>> JanusGraph's composite indexes, and even RDF triple indices (spo, ops,
>>> etc.) as in AllegroGraph in addition to primary keys in relational
>>> databases.
>>> 
>>> TTuple -- cool. +1
>>> 
>>> "Enums" -- I agree that enums are necessary, but we need even more: tagged
>>> unions <https://en.wikipedia.org/wiki/Tagged_union <https://en.wikipedia.org/wiki/Tagged_union>>. They are part of the
>>> system of algebraic data types which I described on Friday. An enum is a
>>> special case of a tagged union in which there is no value, just a type tag.
>>> May I suggest something like TValue, which contains a value (possibly
>>> trivial) together with a type tag. This enables ORs and pattern matching.
>>> For example, suppose "created" edges are allowed to point to either
>>> "Project" or "Document" vertices. The in-type of "created" is
>>> union{project:Project, document:Document). Now the in value of a specific
>>> edge can be TValue("project", [some project vertex]) or TValue("document",
>>> [some document vertex]) and you have the freedom to switch on the type tag
>>> if you want to, e.g. the next step in the traversal can give you the "name"
>>> of the project or the "title" of the document as appropriate.
>>> 
>>> Multi-properties -- agreed; has() is good enough.
>>> 
>>> Meta-properties -- again, this is where I think we should have a
>>> lower-level select() operation. Then has() builds on that operation.
>>> Whereas select() matches on fields of a relation, has() matches on property
>>> values and other higher-order things. If you want properties of properties,
>>> don't use has(); use select()/from(). Most of the time, you will just want
>>> to use has().
>>> 
>>> Agreed that every *entity* should have an id(), and also a label() (though
>>> it should always be possible to infer label() from the context). I would
>>> suggest TEntity (or TElement), which has id(), label(), and value(), where
>>> value() provides the raw value (usually a TTuple) of the entity.
>>> 
>>> Josh
>>> 
>>> 
>>> 
>>> On Mon, Apr 29, 2019 at 10:35 AM Marko Rodriguez <okrammarko@gmail.com <ma...@gmail.com>>
>>> wrote:
>>> 
>>>> Hello Josh,
>>>> 
>>>>> A has("age",29), for example, operates at a different level of
>>>> abstraction than a
>>>>> has("city","Santa Fe") if "city" is a column in an "addresses" table.
>>>> 
>>>> So hasXXX() operators work on TTuples. Thus:
>>>> 
>>>> g.V().hasLabel(‘person’).has(‘age’,29)
>>>> g.V().hasLabel(‘address’).has(‘city’,’Santa Fe’)
>>>> 
>>>> ..both work as a person-vertex and an address-vertex are TTuples. If these
>>>> were tables, then:
>>>> 
>>>> jdbc.db().values(‘people’).has(‘age’,29)
>>>> jdbc.db().values(‘addresses’).has(‘city’,’Santa Fe’)
>>>> 
>>>> …also works as both people and addresses are TTables which extend
>>>> TTuple<String,?>.
>>>> 
>>>> In summary, its its a TTuple, then hasXXX() is good go.
>>>> 
>>>> ////////// IGNORE UNTIL AFTER READING NEXT SECTION //////////
>>>> *** SIDENOTE: A TTable (which is a TSequence) could have Symbol-based
>>>> metadata. Thus TTable.value(#label) -> “people.” If so, then
>>>> jdbc.db().hasLabel(“people”).has(“age”,29)
>>>> 
>>>>> At least, they
>>>>> are different if the data model allows for multi-properties,
>>>>> meta-properties, and hyper-edges. A property is something that can either
>>>>> be there, attached to an element, or not be there. There may also be more
>>>>> than one such property, and it may have other properties attached to it.
>>>> A
>>>>> column of a table, on the other hand, is always there (even if its value
>>>> is
>>>>> allowed to be null), always has a single value, and cannot have further
>>>>> properties attached.
>>>> 
>>>> 1. Multi-properties.
>>>> 
>>>> Multi-properties works because if name references a TSequence, then its
>>>> the sequence that you analyze with has(). This is another reason why
>>>> TSequence is important. Its a reference to a “stream” so there isn’t
>>>> another layer of tuple-nesting.
>>>> 
>>>> // assume v[1] has name={marko,mrodriguez,markor}
>>>> g.V(1).value(‘name’) => TSequence<String>
>>>> g.V(1).values(‘name’) => marko, mrodriguez, markor
>>>> g.V(1).has(‘name’,’marko’) => v[1]
>>>> 
>>>> 2. Meta-properties
>>>> 
>>>> // assume v[1] has name=[value:marko,creator:josh,timestamp:12303] // i.e.
>>>> a tuple value
>>>> g.V(1).value(‘name’) => TTuple<?,String> // doh!
>>>> g.V(1).value(‘name’).value(‘value’) => marko
>>>> g.V(1).value(‘name’).value(‘creator’) => josh
>>>> 
>>>> So things get screwy. — however, it only gets screwy when you mix your
>>>> “metadata” key/values with your “data” key/values. This is why I think
>>>> TSymbols are important. Imagine the following meta-property tuple for v[1]:
>>>> 
>>>> [#value:marko,creator:josh,timestamp:12303]
>>>> 
>>>> If you do g.V(1).value(‘name’), we could look to the value indexed by the
>>>> symbol #value, thus => “marko”.
>>>> If you do g.V(1).values(‘name’), you would get back a TSequence with a
>>>> single TTuple being the meta property.
>>>> If you do g.V(1).values(‘name’).value(), we could get the value indexed by
>>>> the symbol #value.
>>>> If you do g.V(1).values(‘name’).value(‘creator’), it will return the
>>>> primitive string “josh”.
>>>> 
>>>> I believe that the following symbols should be recommended for use across
>>>> all data structures.
>>>>        #id, #label, #key, #value
>>>> …where id(), label(), key(), value() are tuple.get(Symbol). Other symbols
>>>> for use with propertygraph/ include:
>>>>        #outE, #inV, #inE, #outV, #bothE, #bothV
>>>> 
>>>>> In order to simplify user queries, you can let has() and values() do
>>>> double
>>>>> duty, but I still feel that there are lower-level operations at play, at
>>>> a
>>>>> logical level even if not at a bytecode level. However, expressing the a
>>>>> traversal in terms of its lowest-level relational operations may also be
>>>>> useful for query optimization.
>>>> 
>>>> One thing that I’m doing, that perhaps you haven’t caught onto yet, is
>>>> that I’m not modeling everything in terms of “tables.” Each data structure
>>>> is trying to stay as pure to its conceptual model as possible. Thus, there
>>>> are no “joins” in property graphs as outE() references a TSequence<TEdge>,
>>>> where TEdge is an interface that extends TTuple. You can just walk without
>>>> doing any type of INNER JOIN. Now, if you model a property graph in a
>>>> relational database, you will have to strategize the bytecode accordingly!
>>>> Just a heads up in case you haven’t noticed that.
>>>> 
>>>> Thanks for your input,
>>>> Marko.
>>>> 
>>>> http://rredux.com <http://rredux.com/> <http://rredux.com/ <http://rredux.com/>>
>>>> 
>>>> 
>>>> 
>>>>> 
>>>>> Josh
>>>>> 
>>>>> 
>>>>> 
>>>>> On Mon, Apr 29, 2019 at 7:34 AM Marko Rodriguez <okrammarko@gmail.com <ma...@gmail.com>
>>>> <mailto:okrammarko@gmail.com <ma...@gmail.com>>>
>>>>> wrote:
>>>>> 
>>>>>> Hi,
>>>>>> 
>>>>>> *** This email is primarily for Josh (and Kuppitz). However, if others
>>>> are
>>>>>> interested… ***
>>>>>> 
>>>>>> So I did a lot of thinking this weekend about structure/ and this
>>>> morning,
>>>>>> I prototyped both graph/ and rdbms/.
>>>>>> 
>>>>>> This is the way I’m currently thinking of things:
>>>>>> 
>>>>>>       1. There are 4 base types in structure/.
>>>>>>               - Primitive: string, long, float, int, … (will constrain
>>>>>> these at some point).
>>>>>>               - TTuple<K,V>: key/value map.
>>>>>>               - TSequence<V>: an iterable of v objects.
>>>>>>               - TSymbol: like Ruby, I think we need “enum-like” symbols
>>>>>> (e.g., #id, #label).
>>>>>> 
>>>>>>       2. Every structure has a “root.”
>>>>>>               - for graph its TGraph implements TSequence<TVertex>
>>>>>>               - for rdbms its a TDatabase implements
>>>>>> TTuple<String,TTable>
>>>>>> 
>>>>>>       3. Roots implement Structure and thus, are what is generated by
>>>>>> StructureFactory.mint().
>>>>>>               - defined using withStructure().
>>>>>>               - For graph, its accessible via V().
>>>>>>               - For rdbms, its accessible via db().
>>>>>> 
>>>>>>       4. There is a list of core instructions for dealing with these
>>>>>> base objects.
>>>>>>               - value(K key): gets the TTuple value for the provided
>>>> key.
>>>>>>               - values(K key): gets an iterator of the value for the
>>>>>> provided key.
>>>>>>               - entries(): gets an iterator of T2Tuple objects for the
>>>>>> incoming TTuple.
>>>>>>               - hasXXX(A,B): various has()-based filters for looking
>>>>>> into a TTuple and a TSequence
>>>>>>               - db()/V()/etc.: jump to the “root” of the
>>>> withStructure()
>>>>>> structure.
>>>>>>               - drop()/add(): behave as one would expect and thus.
>>>>>> 
>>>>>> ————
>>>>>> 
>>>>>> For RDBMS, we have three interfaces in rdbms/.
>>>>>> (machine/machine-core/structure/rdbms)
>>>>>> 
>>>>>>       1. TDatabase implements TTuple<String,TTable> // the root
>>>>>> structure that indexes the tables.
>>>>>>       2. TTable implements TSequence<TRow<?>> // a table is a sequence
>>>>>> of rows
>>>>>>       3. TRow<V> implements TTuple<String,V>> // a row has string
>>>> column
>>>>>> names
>>>>>> 
>>>>>> I then created a new project at machine/structure/jdbc). The classes in
>>>>>> here implement the above rdbms/ interfaces/
>>>>>> 
>>>>>> Here is an RDBMS session:
>>>>>> 
>>>>>> final Machine machine = LocalMachine.open();
>>>>>> final TraversalSource jdbc =
>>>>>>       Gremlin.traversal(machine).
>>>>>>                       withProcessor(PipesProcessor.class).
>>>>>>                       withStructure(JDBCStructure.class,
>>>>>> Map.of(JDBCStructure.JDBC_CONNECTION, "jdbc:h2:/tmp/test"));
>>>>>> 
>>>>>> System.out.println(jdbc.db().toList());
>>>>>> System.out.println(jdbc.db().entries().toList());
>>>>>> System.out.println(jdbc.db().value("people").toList());
>>>>>> System.out.println(jdbc.db().values("people").toList());
>>>>>> System.out.println(jdbc.db().values("people").value("name").toList());
>>>>>> System.out.println(jdbc.db().values("people").entries().toList());
>>>>>> 
>>>>>> This yields:
>>>>>> 
>>>>>> [<database#conn1: url=jdbc:h2:/tmp/test user=>]
>>>>>> [PEOPLE:<table#PEOPLE>]
>>>>>> [<table#people>]
>>>>>> [<row#PEOPLE:1>, <row#PEOPLE:2>]
>>>>>> [marko, josh]
>>>>>> [NAME:marko, AGE:29, NAME:josh, AGE:32]
>>>>>> 
>>>>>> The bytecode of the last query is:
>>>>>> 
>>>>>> [db(<database#conn1: url=jdbc:h2:/tmp/test user=>), values(people),
>>>>>> entries]
>>>>>> 
>>>>>> JDBCDatabase implements TDatabase, Structure.
>>>>>>       *** JDBCDatabase is the root structure and is referenced by db()
>>>>>> *** (CRUCIAL POINT)
>>>>>> 
>>>>>> Assume another table called ADDRESSES with two columns: name and city.
>>>>>> 
>>>>>> 
>>>>>> 
>>>> jdbc.db().values(“people”).as(“x”).db().values(“addresses”).has(“name”,eq(path(“x”).by(“name”))).value(“city”)
>>>>>> 
>>>>>> The above is equivalent to:
>>>>>> 
>>>>>> SELECT city FROM people,addresses WHERE people.name=addresses.name
>>>>>> 
>>>>>> If you want to do an inner join (a product), you do this:
>>>>>> 
>>>>>> 
>>>>>> 
>>>> jdbc.db().values(“people”).as(“x”).db().values(“addresses”).has(“name”,eq(path(“x”).by(“name”))).as(“y”).path(“x”,”y")
>>>>>> 
>>>>>> The above is equivalent to:
>>>>>> 
>>>>>> SELECT * FROM addresses INNER JOIN people ON people.name=addresses.name
>>>>>> 
>>>>>> NOTES:
>>>>>>       1. Instead of select(), we simply jump to the root via db() (or
>>>>>> V() for graph).
>>>>>>       2. Instead of project(), we simply use value() or values().
>>>>>>       3. Instead of select() being overloaded with by() join syntax, we
>>>>>> use has() and path().
>>>>>>               - like TP3 we will be smart about dropping path() data
>>>>>> once its no longer referenced.
>>>>>>       4. We can also do LEFT and RIGHT JOINs (haven’t thought through
>>>>>> FULL OUTER JOIN yet).
>>>>>>               - however, we don’t support ‘null' in TP so I don’t know
>>>>>> if we want to support these null-producing joins. ?
>>>>>> 
>>>>>> LEFT JOIN:
>>>>>>       * If an address doesn’t exist for the person, emit a
>>>> “null”-filled
>>>>>> path.
>>>>>> 
>>>>>> jdbc.db().values(“people”).as(“x”).
>>>>>> db().values(“addresses”).as(“y”).
>>>>>>   choose(has(“name”,eq(path(“x”).by(“name”))),
>>>>>>     identity(),
>>>>>>     path(“y”).by(null).as(“y”)).
>>>>>> path(“x”,”y")
>>>>>> 
>>>>>> SELECT * FROM addresses LEFT JOIN people ON people.name=addresses.name
>>>>>> 
>>>>>> RIGHT JOIN:
>>>>>> 
>>>>>> jdbc.db().values(“people”).as(“x”).
>>>>>> db().values(“addresses”).as(“y”).
>>>>>>   choose(has(“name”,eq(path(“x”).by(“name”))),
>>>>>>     identity(),
>>>>>>     path(“x”).by(null).as(“x”)).
>>>>>> path(“x”,”y")
>>>>>> 
>>>>>> 
>>>>>> SUMMARY:
>>>>>> 
>>>>>> There are no “low level” instructions. Everything is based on the
>>>> standard
>>>>>> instructions that we know and love. Finally, if not apparent, the above
>>>>>> bytecode chunks would ultimately get strategized into a single SQL query
>>>>>> (breadth-first) instead of one-off queries (depth-first) to improve
>>>>>> performance.
>>>>>> 
>>>>>> Neat?,
>>>>>> Marko.
>>>>>> 
>>>>>> http://rredux.com <http://rredux.com/> <http://rredux.com/ <http://rredux.com/>> <http://rredux.com/ <http://rredux.com/> <
>>>> http://rredux.com/ <http://rredux.com/>>>
>> 
> 


Re: The Fundamental Structure Instructions Already Exist! (w/ RDBMS Example)

Posted by Joshua Shinavier <jo...@fortytwo.net>.
OK, beginning at the beginning.


On Mon, May 6, 2019 at 3:58 AM Marko Rodriguez <ok...@gmail.com> wrote:

> Hey Josh,
>
>
> > One more thing is needed: disjoint unions. I described these in my email
> on
> > algebraic property graphs. They are the "plus" operator to complement the
> > "times" operator in our type algebra. A disjoint union type is just like
> a
> > tuple type, but instead of having values for field a AND field b AND
> field
> > c, an instance of a union type has a value for field a XOR field b XOR
> > field c. Let me know if you are not completely sold on union types, and I
> > will provide additional motivation.
>
> Huh. That is an interesting concept. Can you please provide examples?
>

Yes. If you think back to your elementary school algebra, you will recall
four basic associative operations: addition, multiplication, subtraction,
and division. Simple stuff, but let's make things even simpler by throwing
out inverses. So we have: addition and multiplication. You also need unit
elements 0 and 1 which have the usual properties. This structure is called
a semiring <https://en.wikipedia.org/wiki/Semiring>, and with it, you can
build up a rich type system, and allows you to reason on equations of
types. Multiplication represents the concatenation of tuples -- a × b × c
is a type that has a AND b and c -- whereas addition represents a choice -- a +
b + c is a type that has a XOR b XOR c.

Examples of multiplication are edges (e.g. a knows edge type is the product
of Person and Person; the out-vertex is a person, and the in-vertex is a
person) and properties (e.g. age is a product of Person and the primitive
integer type). For example, you could express the knows type as Person
× Person or as prod{out=Person, in=Person} if you want to give names to the
components of tuples (fields).

Examples of addition are in- or out-types which are a disjunction of other
types. For example, in the TInkerPop classic graph, the name property can
attach to either a Person or a Project, so the type is (Person +
Project) × string, or prod{out=sum{person=Person, project=Project},
in=string} if you want field names.

Just as the teacher made you do at the blackboard, you can distribute
multiplication over a sum, so

(Person + Project) × string = (Person × string) + (Project × string)


 In other words, a name property which can attach either to a person or
project is equivalent to two distinct properties, maybe call them personName
and projectName, which each attach to only one type of vertex.

Other fun things you can build with unions include lists, trees, and other
recursive data structures. How do you formalize a "list of people" as a
type? Well, you can think of it in this way:

ListOfPeople = () + (Person) + (Person × Person) + (Person × Person ×
Person) + ...


In other words, a list of people can be either the additive unit (0-tuple),
a single person, a pair of people, a triplet of people... an n-tuple of
people for any n >= 0. You could also write:

ListOfPeople = () + (Person × ListOfPeople)


Products let you concatenate types and tuples to build larger types and
tuples; sums enable choices and pattern matching.



> One thing I want to stress. The “universal bytecode” is just standard
> [op,arg*]* bytecode save that data access is via the “universal model's"
> db() instruction. Thus, AND/OR/pattern matching/etc. is all available.
> Likewise union(), repeat(), coalesce(), choose(), etc. are all available.
>
> db().and(as('a').values('knows').as('b'),
>          or(as('a').has('name','marko'),
>             as('a').values(‘created').count().is(gt(1))),
>          as('b').values(’created').as('c')).
>      path(‘c')
>

No disagreement. This is essentially functional pattern matching as
motivated above, though it includes a condition we wouldn't include in the
type system itself: the "created" count.



> As you can see, and()/or() pattern matching is possible and can be nested.
>   *** SIDENOTE: In TP3, such nested and()/or() pattern matching is
> expressed using match() where the root grouping is assumed to be and()’d
> together.
>

Yep.



>   *** SIDENOTE: In TP4, I want to get rid of an explicit match() bytecode
> instruction and replace it with and()/or() instructions with prefix/suffix
> as()s.
>

Hmm. I think the match() syntax is useful, even if you can build match()
expressions out of and() and or(). Or maybe we just point users to
OpenCypher if they want conjunctive query patterns. Jeremy Hanna and I
chatted about this at the conference earlier this week... it is really just
a matter of providing the best syntactic sugar. You CAN do everything that
match() or OpenCypher can do in Gremlin, but this is not to say you always
SHOULD.



>    [...]

> Or other tuples, or tagged values. E.g. any edge projects to two vertices,
> > which are (trivial) tuples as opposed to primitive values.
>
> Good point. I started to do some modeling and I’ve been getting some good
> mileage from a new “pointer” primitive. Assume every N-Tuple has a unique ID


Minor point: it's also OK to have tuples with no id; not everything needs
to be an entity.



> (outside the data models id space). If so, the TinkerPop toy graph as
> N-Tuples is:
>
> [0][id:1,name:marko,age:29,created:*1,knows:*2]
> [1][0:*3]
> [2][0:*4,1:*5]
> [3][id:3,name:lop,lang:java]
> [4][id:2,name:vadas,age:27]
> [5][id:4,name:josh,age:32,created*:…]
>

I don't think we are quite aligned on what belongs in a tuple (which you
note below). I would rewrite these as:

{id=1, label=person}
{id=4, label=person}
{id=8, label=knows, out=1, in=4}
{id=13, label=name, out=1, in="marko"}

etc., where just for readability, we are putting id and label in the same
namespace as the fields of the tuple.



> I know you are thinking that vertices don’t have “outE” projections so
> this isn’t inline with your thinking.


True...



> However, check this out. If we assume that pointers are automatically
> dereferenced on reference then:
>
> db().has(‘name’,’marko’).values(‘knows’).values(‘name’) => vadas, josh
>

This is more compact than something like:

  value("marko").select("name", "in").project("out").select("knows",
"out").project("in").select("name", "out").project("in")

 but that is what I see as the most low-level bytecode for your expression.
I see your has() as sugar.


Pointers are useful when a tuple has another tuple as a value. Instead of
> nesting, you “blank node.” DocumentDBs (with nested list/maps) would use
> this extensively.
>

I agree, although in the case of lists/maps, again you don't always need a
"pointer" because the value doesn't always need to be an entity. You are
actually making explicit here that all tuples are to be treated as
entities, which is OK -- RDF does it -- but IMO not necessary. If you just
want a list of 100 people, you don't care whether the list notes have a
unique identity in the graph; in fact, it may be more efficient not to give
them any.



> > Grumble... db() is just an alias for select()... grumble…
>
> select() and project() are existing instructions in TP3 (TP4?).
>

Kinda.



> Like indices, I don’t think we should introduce types. But this is up for
> further discussion...
>

Let's keep discussing. With types comes a lot of opportunity for static
analysis and optimization, not to mention functional pattern matching.



> > Which is to say that we define the out-type of "name" to be the disjoint
> > union of all element types. The type becomes trivial. However, we can
> also
> > be more selective if we want to, restricting "name" only to a small
> subset
> > of types.
>
> Hm… I’m listening. I’m running into problems in my modeling when trying to
> generically fit things into relational tables. Maybe typing is necessary :(.
>

Types are your friend, and union types let you keep your friend at arm's
distance if you want to.



> [...]
> > I think we will see steps like V() and R() in Gremlin, but do not need
> them
> > in bytecode. Again, db() is just select(), V() is just select(), etc. The
> > model-specific interfaces adapt V() to select() etc.
>
> Hm. See my points above. Having providers reason at the “universal
> model”-level seems intense. ?
>

I have warmed up to the idea of provider (model-) specific instructions,
which compile down to universal bytecode when providers are done with their
optimizations.



>
>
> >    select(foafPerson)
> >
> > The second expression becomes:
> >
> >    value("marko").select(foafName, "in").project("out”)
> > ...which you can rewrite with has(); I just think the above is clear
> w.r.t.
> > low-level operations. The value() is just providing a start of "marko",
> > which is a string value. No need for xsd:string if we have a deep mapping
> > between RDF and APG.
>
>
> Hm… I see your type “slots” model and fear the global typing in a
> (potentially) schemaless world. For me, everything should be standard
> has()/values() TP bytecode off of an “get all” db()…  ? However, I’m open
> to seeing examples that demonstrate easier reasoning.
>

I agree that we want loosely-typed steps as well, but this is where "sugar"
like has() comes in. E.g. my

value("marko").select("name", "in").project("out")


or your

db().has("name", "marko")


both hide the fact that "name" is a property type which allows either a
person or a project as its out-vertex. However, if we do have that type
defined, then we can do some inference. E.g. we know that the result of
this traversal is a collection of either people or projects, and now we can
do smarter things if the traversal is composed together with other
traversals. The type system should be opt-in.
\


> [...]
> Yes. That is the whole point of this rabbit hole!
>         * any query language -> universal model -> any data model.
>

+1



> Vendor instructions are crucial to allow the vendor to interact with their
> database’s custom optimizations.
>

I'm sold.



> [...]
> I really don’t think so. There are too many variations to indexing.


Ah, but I would claim that all of them can be boiled down to algebraic data
types. Give me an example of an index, and I will give you the type. For
example, consider a geospatial index that resolves a (lat, lon) point to a
collection of nearby Restaurant vertices: (double × double × Restaurant),
where we resolve keys to values in the index from left to right. Give me a
double, and I will give you a (double × Restaurant) index. Give me another
double, and I will give you restaurants. The ordering of elements returned
by the index is not part of the type system; it is up to the vendor.

How about a vertex-centric index of Trips on drop-off time? That's even
simpler: (long × Trip). Again, give me a long, and I will give you a
collection of trips.

Let me know if/when I have convinced you that this not a "rat's nest" and
provides further opportunities for optimization.



> [...]
> I do like GMachine too. But, I think TP4 VM is best for now.
>

Sure. I thought you were suggesting a new name "Database Virtual Machine",
which would put us in congested air space.


I don’t think graph should be front-and-center. Graph is just another data
> model much like RDF, Document, Relational, etc.


Yes, but instead of saying "we are not only graphs", you can say "these
other things can also be treated as graphs", which is more exciting to me,
because I like graphs.



> In fact, “graph” will have numerous flavors:
>
>         Graph w/ multi-properties, meta-properties, vertex multi-labels, …
>                 - All captured in the pg/ interfaces. How exactly, not
> sure.
>

Agreed.



> Awesome stuff. Excited to receive your response.
>

Might not get through all of your emails today, but I will get to them. If
I were to starting implementing the type system, would you be open to Scala
as an implementation language?

Josh




>
> Marko.
>
> http://rredux.com <http://rredux.com/>
>
>

Re: The Fundamental Structure Instructions Already Exist! (w/ RDBMS Example)

Posted by Marko Rodriguez <ok...@gmail.com>.
Hey Josh,


> One more thing is needed: disjoint unions. I described these in my email on
> algebraic property graphs. They are the "plus" operator to complement the
> "times" operator in our type algebra. A disjoint union type is just like a
> tuple type, but instead of having values for field a AND field b AND field
> c, an instance of a union type has a value for field a XOR field b XOR
> field c. Let me know if you are not completely sold on union types, and I
> will provide additional motivation.

Huh. That is an interesting concept. Can you please provide examples?

>> The instructions:
>>        1. relations can be “queried” for matching tuples.
>> 
> 
> Yes.

One thing I want to stress. The “universal bytecode” is just standard [op,arg*]* bytecode save that data access is via the “universal model's" db() instruction. Thus, AND/OR/pattern matching/etc. is all available. Likewise union(), repeat(), coalesce(), choose(), etc. are all available.

db().and(as('a').values('knows').as('b'),
         or(as('a').has('name','marko'),
            as('a').values(‘created').count().is(gt(1))),
         as('b').values(’created').as('c')).
     path(‘c')

As you can see, and()/or() pattern matching is possible and can be nested.
  *** SIDENOTE: In TP3, such nested and()/or() pattern matching is expressed using match() where the root grouping is assumed to be and()’d together.
  *** SIDENOTE: In TP4, I want to get rid of an explicit match() bytecode instruction and replace it with and()/or() instructions with prefix/suffix as()s.
  *** SIDENOTE: In TP4, in general, any nested bytecode that starts with as(x) is path(x) and any bytecode that ends with as(y) is where(eq(path(y)).

> 
>>        2. tuple values can be projected out to yield primitives.
>> 
> 
> Or other tuples, or tagged values. E.g. any edge projects to two vertices,
> which are (trivial) tuples as opposed to primitive values.

Good point. I started to do some modeling and I’ve been getting some good mileage from a new “pointer” primitive. Assume every N-Tuple has a unique ID (outside the data models id space). If so, the TinkerPop toy graph as N-Tuples is:

[0][id:1,name:marko,age:29,created:*1,knows:*2]
[1][0:*3]
[2][0:*4,1:*5]
[3][id:3,name:lop,lang:java]
[4][id:2,name:vadas,age:27]
[5][id:4,name:josh,age:32,created*:…]

I know you are thinking that vertices don’t have “outE” projections so this isn’t inline with your thinking. However, check this out. If we assume that pointers are automatically dereferenced on reference then:

db().has(‘name’,’marko’).values(‘knows’).values(‘name’) => vadas, josh

Pointers are useful when a tuple has another tuple as a value. Instead of nesting, you “blank node.” DocumentDBs (with nested list/maps) would use this extensively.

> Grumble... db() is just an alias for select()... grumble…

select() and project() are existing instructions in TP3 (TP4?).

	SELECT
	db() will iterate all N-Tuples
	has() will filter out those N-Tuples with respective key/values.
	and()/or() are used for nested pattern matching.
	
	PROJECT
	values() will project out the n-tuple values.

> Here, we are kind of mixing fields with property keys. Yes,
> db().has('name', 'marko') can be used to search for elements of any type...
> if that type agrees with the out-type of the "name" relation. In my
> TinkerPop Classic example, the out type of "name" is (Person OR Project),
> so your query will get you people or projects.

Like indices, I don’t think we should introduce types. But this is up for further discussion...

> Which is to say that we define the out-type of "name" to be the disjoint
> union of all element types. The type becomes trivial. However, we can also
> be more selective if we want to, restricting "name" only to a small subset
> of types.

Hm… I’m listening. I’m running into problems in my modeling when trying to generically fit things into relational tables. Maybe typing is necessary :(.


> Good idea. TP4 can provide several "flavors" of interfaces, each of which
> is idiomatic for each major class of database provider. Meeting the
> providers halfway will make integration that much easier.

Yes. With respects to graphdb providers, they want to think in terms of Vertex/Edges/etc. We want to put the bytecode in their language so:

	1. It is easier for them to write custom strategies.
	2. inV() can operate on their Vertex object without them having to implement inV().
		*** Basically just like TP3 is now. GraphDB providers implement Graph/Vertex/Edge and everything works! However, they will then want to write custom instructions/strategies to do use their databases optimizations such as vertex-centric indices for outE(‘knows’).has(‘stars’,gt(3)).inV().


> I think we will see steps like V() and R() in Gremlin, but do not need them
> in bytecode. Again, db() is just select(), V() is just select(), etc. The
> model-specific interfaces adapt V() to select() etc.

Hm. See my points above. Having providers reason at the “universal model”-level seems intense. ?


>    select(foafPerson)
> 
> The second expression becomes:
> 
>    value("marko").select(foafName, "in").project("out”)
> ...which you can rewrite with has(); I just think the above is clear w.r.t.
> low-level operations. The value() is just providing a start of "marko",
> which is a string value. No need for xsd:string if we have a deep mapping
> between RDF and APG.


Hm… I see your type “slots” model and fear the global typing in a (potentially) schemaless world. For me, everything should be standard has()/values() TP bytecode off of an “get all” db()…  ? However, I’m open to seeing examples that demonstrate easier reasoning.

Here are some examples I’ve been playing with using db()/has()/values() over DocumentDB data:
	https://gist.github.com/okram/764033e215906787217bc3176bb3bb15 <https://gist.github.com/okram/764033e215906787217bc3176bb3bb15>
> Yes, nice. We can even take things a step further and decouple the query
> language from the database. Have a property graph database, but want to
> evaluate SPARQL? No problem. Have a relational database but want to do
> Gremlin traversal? No worries.

Yes. That is the whole point of this rabbit hole!
	* any query language -> universal model -> any data model.

> Not sure about vendor-specific instructions; a
> lot can be done in the mapping of relations to instructions which live
> entirely within black box of the vendor code.

Vendor instructions are crucial to allow the vendor to interact with their database’s custom optimizations.

V().has(‘name’,’marko’) 
  => jg:v-index(‘name’,’marko’)
outE(‘knows’).has(‘stars’,gt(3)).inV()
  => jg:vcentric-index-out(‘knows’,’stars’,gt(3))

However, I would like to understand (via examples) what you are talking about as that sounds super interesting!


> Back to indexes. IMO there should be a vendor-neutral API. Even extremely
> vendor-specific indexes like geotemporal indexes could be exposed through a
> common API, e.g.
> 
>    select("Dropoffs", {lat:37.7740, lon:122.4149, time:1556899302149})
> 
> which resolves to a vendor-specific index.

I really don’t think so. There are too many variations to indexing. TP doesn’t need to go down that rats nest. That is what vendor-specific strategies/instructions are for — let them decide how to fold has().has().has() into a single index lookup. We see everything as linear scans.

> I actually like your term "GMachine", and I don't think it's a bad idea to
> keep "graph" front and center. Yes, TP4 shall have the flexibility to
> interoperate with a variety of non-graph databases, but what it adds is a
> unifying graph abstraction.

I do like GMachine too. But, I think TP4 VM is best for now.

I don’t think graph should be front-and-center. Graph is just another data model much like RDF, Document, Relational, etc. In fact, “graph” will have numerous flavors:

	Graph w/ multi-properties, meta-properties, vertex multi-labels, …
		- All captured in the pg/ interfaces. How exactly, not sure.


Awesome stuff. Excited to receive your response.

Marko.

http://rredux.com <http://rredux.com/>


Re: The Fundamental Structure Instructions Already Exist! (w/ RDBMS Example)

Posted by Joshua Shinavier <jo...@fortytwo.net>.
Hi Marko,

Thanks for the detailed emails. Responses inline.


On Thu, May 2, 2019 at 6:40 AM Marko Rodriguez <ok...@gmail.com> wrote:

> [...]
> Thus, there exists a data model that can describe these database
> structures in a database agnostic manner.
>         - not in terms of tables, vertices, JSON, column families, …
>

100% with you on this.



> While we call this a “universal model” it is NOT more “general”
> (theoretically powerful) than any other database structure.
>

I agree. We should be trying harder to find equivalences, as opposed to
introducing a "bigger, better, brand-new shiny" data model.



> Reasons for creating a “universal model”.
>
>         1. To have a reduced set of objects for the TP4 VM to consider.
>                 - edges are just vertices with one incoming and outgoing
> “edge.”
>

Kinda. Let's say edges are elements with two fields. Vertices are elements
with no fields.



>                 - a column family is just a “map” of rows which are just
> “maps.”
>

Kinda. Let's say a table / column family is a data type with a number of
fields. Equivalently, it is a relation with a number of columns. You
brought up a good point in your previous email w.r.t. "person" vs.
"people", but that's why mappings are needed. A trivial schema mapping
gives you an element type "person" from a relation/table "people" and vice
versa. The table and the type are equivalent.



>                 - tables are just groupings of schema-equivalent rows.
>

Agreed. The "universal model" just makes an element out of each row.



>         2. To have a limited set of instructions in the TP4 bytecode
> specification.
>                 - outE/inE/outV/inV are just following direct “links”
> between objects.
>

inV and outV, yes, because they are fields of an edge element. outE and inE
are different, because they are not fields of the vertex. However, they are
functions. You can put them in the same namespace as inV and outV if you
want to; just keep in mind that in terms of relational algebra, they are a
fundamentally different operation.



>                 - has(), values(), keys(), valueMap(), etc. need not just
> apply to vertices and edges.
>

Agreed.



>         3. To have a simple serialization format.
>                 - we do not want to ship around
> rows/vertices/edges/documents/columns/etc.
>                 - we want to make it easy for other languages to integrate
> with the TP4 VM.
>                 - we want to make it easy to create TP4 VMs in other
> languages.
>

What is easier than a table? Any finite graph in this model is just a
collection of tables which can be shipped around as CSVs, among other
formats.



>         4. To have a theoretical understanding of the relationship between
> the various data structures.
>                 - “this is just a that” is useful to limit the
> complexities of our codebase and explain to the public how different
> database relate.
>

Yes.



> [...]
> The objects:
>         1. primitives: floats, doubles, Strings, ints, etc.
>

Yes.



>         2. tuples: key’d collections of primitives. (instances)
>         3. relations: groupings of tuples with ?equivalent? schemas.
> (types)
>

These are the same thing. A tuple is a row, is an element. A relation is a
set of elements/tuples/rows of the same type.

One more thing is needed: disjoint unions. I described these in my email on
algebraic property graphs. They are the "plus" operator to complement the
"times" operator in our type algebra. A disjoint union type is just like a
tuple type, but instead of having values for field a AND field b AND field
c, an instance of a union type has a value for field a XOR field b XOR
field c. Let me know if you are not completely sold on union types, and I
will provide additional motivation.



> The instructions:
>         1. relations can be “queried” for matching tuples.
>

Yes.



>         2. tuple values can be projected out to yield primitives.
>

Or other tuples, or tagged values. E.g. any edge projects to two vertices,
which are (trivial) tuples as opposed to primitive values.


Lets do a “traversal” from marko to the people he knows.
>
> // g.V().has(‘name’,’marko’).outE(‘knows’).inV().values(‘name’)
>
> db(‘person’).has(‘name’,’marko’).as(‘x’).
> db(‘knows’).has(‘#outV’, path(‘x’).by(‘#id’)).as(‘y’).
> db(‘person’).has(‘#id’, path(‘y’).by(‘#inV’)).
>   values(‘name’)
>

I still don't think we need the "db" step, but I think that syntax works --
you are distinguishing between fields and higher-order things like
properties by using hash characters for the field names.



> While the above is a single stream of processing, I will state what each
> line above has at that point in the stream.
>         - [#label:person,name:marko,age:29]
>

Keeping in mind that "name" and "age" are property keys as opposed to
fields, yes.



>         - [#label:knows,#outV:1,#inV:2,weight:0.5], ...
>         - [#label:person,name:vadas,age:27], ...
>         - vadas, ...
>

OK.



> Databases strategies can be smart to realize that only the #id or #inV or
> #outV of the previous object is required and thus, limit what is actually
> accessed and flow’d through the processing engine.
>         - [#id:1]
>         - [#id:0,#inV:2] …
>         - [#id:2,name:vadas] …
>         - vadas, ...
>

OK.



> *** More on such compiler optimizations (called strategies) later ***
>
> POSITIVE NOTES:
>
>         1. All relations are ‘siblings’ accessed via db().
>

Grumble... db() is just an alias for select()... grumble...



>                 - There is no concept of nesting data. A very flat
> structure.
>

Agreed.



>         2. All subsequent has()/where()/is()/etc.-filter steps after db()
> define the pattern match query.
>                 - It is completely up to the database to determine how to
> retrieve matching tuples.
>                 - For example: using indices, pointer chasing, linear
> scans w/ filter, etc.
>

And yet I think we should make certain indices explicit, as motivated in an
earlier email. That lets us do a certain amount of query optimization at
the TP level, as opposed to leaving all optimizations to the underlying
database.



>         3. All subsequent map()/flatmap()/etc. steps are projections of
> data in the tuple.
>                 - The database returns key’d tuples composed of primitives.
>                 - Primitive data can be accessed and further processed.
> (projections)
>

Cool.



>         4. The bytecode describes a computation that is irrespective of
> the underlying database’s encoding of that structure.
>                 - Amazon Neptune, MySQL, Cassandra, Spark, Hadoop, Ignite,
> etc. can be fed the same bytecode and will yield the same result.
>                 - In other words, given the example above. all databases
> can now process property graph traversals.
>

+1



> NEGATIVE NOTES:
>
>         1. Every database has to have a concept of grouping similar tuples.
>

Sure. That is the same as saying that every database needs to respect a
type system.



>         2. It implies an a priori definition of types (at least their
> existence and how to map data to them).
>

It does.



>         3. It implies a particular type of data model even though its
> represented using the “universal model."
>                 - the example above is a “property graph query” because of
> #outV, #inV, etc. schema’d keys.
>                 - the above example is a “vertex/edge-labeled property
> graph query”  because ‘person’ and ‘knows’ relations.
>                 - the above example implies that keys are unique to
> relations. (e.g. name=marko — why db(‘person’)?)
>                         - though db().has(‘name’,’marko’) can be used to
> search all relations.
>

Here, we are kind of mixing fields with property keys. Yes,
db().has('name', 'marko') can be used to search for elements of any type...
if that type agrees with the out-type of the "name" relation. In my
TinkerPop Classic example, the out type of "name" is (Person OR Project),
so your query will get you people or projects.



>         4. It requires the use of path()-data.
>                 - though we could come up with an efficient
> traverser.last() which returns the previous object touched.
>                 - However, for multi-db() relation matches, as().path()
> will have to be used.
>                         - This can be optimized out by property graph
> databases as they support pointer chasing. (** more on this later **)
>

I don't see how it requires path() data, but adhering to a type system will
mean that we have well-typed paths.


We can relax ‘apriori’-typing to enable ’name’=‘marko’ to be in any
> relation group, not just people relations. Also, lets use the concept of
> last() from (4).
>

Which is to say that we define the out-type of "name" to be the disjoint
union of all element types. The type becomes trivial. However, we can also
be more selective if we want to, restricting "name" only to a small subset
of types.



> [...]
> We can make typing completely dynamic and thus, relation groups don’t
> exist in the “universal model.” Thus, databases don’t have to even have a
> concept of groups of relations. However, databases can have relation groups
> via “indices" on #type, #type+#label, etc.
>

I agree that many applications will want a very relaxed type system. Others
will want to be more restrictive, which makes a lot more static analysis
possible. Union types provide a spectrum between dynamic and static typing.



> [...]
> The above really states that we are dealing with an “vertex/edge-labeled
> property graph”. This is not bad, because we already had the problem of the
> existence of #inV/#outE/etc. so this isn’t any more limiting. Next, TP4
> bytecode is starting to look like SPARQL pattern matching. There are tuples
> and we are matching patterns where data in some tuple equals (or general
> predicate) data in another tuple, etc. The “universal model” is just a
> sequence of key’d tuples with variable keys and lengths! (like an n-tuple
> store).
>

Yes indeed.


[...]
> All integrating database providers must support the “universal model" db()
> instruction. Its easy to implement, but is inefficient because bytecode
> using that instruction require a bunch of back-and-forths of data from DB
> to TP4 VM. Thus, TP4 will provide strategies to map db().filter()*-bytecode
> (i.e. universal model instructions) to instructions that respect their
> native structure.
>

If you mean that we will want to push operations down to the DB when
possible (for example, SQL queries for a relational database, SPARQL
queries for an RDF triple store with its own optimizations, etc.) then I
agree.



> Every database provider implements the TP4 interfaces that captures their
> native database encoding.
>         - For example, RDBMS:
> https://github.com/apache/tinkerpop/tree/tp4/java/machine/machine-core/src/main/java/org/apache/tinkerpop/machine/structure/rdbms
> <
> https://github.com/apache/tinkerpop/tree/tp4/java/machine/machine-core/src/main/java/org/apache/tinkerpop/machine/structure/rdbms
> >
>         - For example, Property Graph:
> https://github.com/apache/tinkerpop/tree/tp4/java/machine/machine-core/src/main/java/org/apache/tinkerpop/machine/structure/graph
> <
> https://github.com/apache/tinkerpop/tree/tp4/java/machine/machine-core/src/main/java/org/apache/tinkerpop/machine/structure/graph
> >
>         - For example, RDF:
> https://github.com/apache/tinkerpop/tree/tp4/java/machine/machine-core/src/main/java/org/apache/tinkerpop/machine/structure/rdf
> <
> https://github.com/apache/tinkerpop/tree/tp4/java/machine/machine-core/src/main/java/org/apache/tinkerpop/machine/structure/rdf
> >
>         - For example, Wide-Column…
>         - For example, Document…
>         - For example, HyperGraph…
>         - etc.
>

Good idea. TP4 can provide several "flavors" of interfaces, each of which
is idiomatic for each major class of database provider. Meeting the
providers halfway will make integration that much easier.



> TP4 will have lots of these interface packages (which will also include
> compiler strategies and instructions).
>

+1



> The db()-filter()* “universal model” bytecode is submitted to the TP4 VM.
> The TP4 VM looks at the integrated databases’ native structure (according
> to the interfaces it implements) and rewrites all db().filter()*-aspects of
> the submitted bytecode to a database-type specific instruction set that:
>         1. respects the semantics of the the underlying database encoding.
>         2. respects the semantics of TP4’s stream processing (i.e.
> linear/nested functions)
> For example, the previous “universal model" bytecode is rewritten for each
> database type as:
>
> Property graphs:
>         pg:V().has(‘name’,’marko’).pg:outE(‘knows’).pg:inV().values(‘name’)
>

I think we will see steps like V() and R() in Gremlin, but do not need them
in bytecode. Again, db() is just select(), V() is just select(), etc. The
model-specific interfaces adapt V() to select() etc.



> RDBMS:
>   rdbms:R(‘person’).has(‘name’,’marko’)).
>     join(rdbms:R(‘knows’)).by(’#id’,eq(‘#outV’)).
>     join(rdbms:R(‘person’)).by(‘#inV’,eq(‘#id’)).values(‘name’)
>
> RDF:
>   rdf:T().has(’p’,’rdf:type’).has(‘o’,’foaf:Person’).as(‘a’).
>
> rdf:T().has(’s’,path(‘a’).by(’s’)).has(‘p’,’foaf:name’).has(‘o’,’marko^^xsd:string’).
>   rdf:T().has(’s’,path(‘a').by(’s’)).has(‘p’,’#outE’).as(‘b’).
>
> rdf:T().has(’s’,path(‘b').by(’o’)).has(‘p’,’rdf:type’).has(‘o’,’foaf:knows’).as(‘c’).
>   rdf:T().has(’s’,path(‘c’).by(‘o’)).has(‘p’,’#inV’).as(‘d’).
>   rdf:T().has(’s’,path(‘d’).by(‘o’)).has(‘p,’rdf:name’).values(‘o’)
>


Same comments. In the case of RDF, we may even want a deeper integration of
RDF types with APG / the universal model. E.g. the first expression just
becomes:

    select(foafPerson)

The second expression becomes:

    value("marko").select(foafName, "in").project("out")

...which you can rewrite with has(); I just think the above is clear w.r.t.
low-level operations. The value() is just providing a start of "marko",
which is a string value. No need for xsd:string if we have a deep mapping
between RDF and APG.



> Next, TP4 will have strategies that can be generally applied to each
> database-type to further optimize the bytecode.
>
> Property graphs:
>         pg:V().has(‘name’,’marko’).pg:out(‘knows’).values(‘name’)
>
> RDBMS:
>         rdbms:sql(“SELECT name FROM person,knows,person WHERE p1.id=knows.inV
> …”)
>
> RDF:
>         rdf:sparql(“SELECT ?e WHERE { ?x rdf:type foaf:Person. ?x
> foaf:name marko^^xsd …”)
>

Yes, nice. We can even take things a step further and decouple the query
language from the database. Have a property graph database, but want to
evaluate SPARQL? No problem. Have a relational database but want to do
Gremlin traversal? No worries.



> Finally, vendors can then apply their custom strategies. For instance, for
> JanusGraph:
>
>
> jg:v-index(’name’,’marko’,grab(‘out-edges')).jg:out(‘knows’,grab(‘in-vertex’,’name-property').values(‘name’)
>

Sure.



> * The “universal model” instruction set must be supported by every
> database type. [all databases]
>

+1



> * The database-type specific instructions (e.g. V(), sparql(), sql(),
> out(), etc.) are only required to be understood by databases that implement
> that type interface. [database class]
>

+ 0.5. I think the model-specific interfaces are a great idea, but that
doesn't mean we can't invoke sparql() on a non-RDF database, can't use V()
for Gremlin-style traversals over non-PG databases, etc. +1 to
vendor-specific strategies. Not sure about vendor-specific instructions; a
lot can be done in the mapping of relations to instructions which live
entirely within black box of the vendor code.



> * All  vendor-specific instructions (e.g. jg:v-index()) are only required
> to be understood by that particular database. [database instance]
>

Back to indexes. IMO there should be a vendor-neutral API. Even extremely
vendor-specific indexes like geotemporal indexes could be exposed through a
common API, e.g.

    select("Dropoffs", {lat:37.7740, lon:122.4149, time:1556899302149})

which resolves to a vendor-specific index.



> [...]
> The million dollar question:
>
>         "Why would you want to encode an X data structure into a database
> that natively supports a Y data structure?”
> [...]
>

I agree with your answers. Not much to add.



> And there you have it — I believe Apache TinkerPop is on the verge of
> offering a powerful new data(base) theory and technology.
>
>         The Database Virtual Machine
>

+1

I actually like your term "GMachine", and I don't think it's a bad idea to
keep "graph" front and center. Yes, TP4 shall have the flexibility to
interoperate with a variety of non-graph databases, but what it adds is a
unifying graph abstraction.


Josh

Re: The Fundamental Structure Instructions Already Exist! (w/ RDBMS Example)

Posted by Marko Rodriguez <ok...@gmail.com>.
Hey Josh (others),

I was thinking of our recent divergence in thought. I thought it would be smart for me to summarize where we are and to do my best to describe your model so as to better understand your perspective and to help you better understand how your model will ultimately execute on the TP4 VM.

############################
# WHY A UNIVERSAL MODEL? #
###########################

Every database data model can be losslessly embedded in every other database data model.
	- e.g. you can embed a property graph structure in a relational structure.
	- e.g. you can embed a document structure in a property graph structure.
	- e.g. you can embed a wide-column structure in a document structure.
	- …
	- e.g. you can embed a property graph structure in a Hadoop sequence file or Spark RDD.

Thus, there exists a data model that can describe these database structures in a database agnostic manner.
	- not in terms of tables, vertices, JSON, column families, …

While we call this a “universal model” it is NOT more “general” (theoretically powerful) than any other database structure.

Reasons for creating a “universal model”.

	1. To have a reduced set of objects for the TP4 VM to consider.
		- edges are just vertices with one incoming and outgoing “edge.”
		- a column family is just a “map” of rows which are just “maps.”
		- tables are just groupings of schema-equivalent rows.
		- …
	2. To have a limited set of instructions in the TP4 bytecode specification.
		- outE/inE/outV/inV are just following direct “links” between objects.
		- has(), values(), keys(), valueMap(), etc. need not just apply to vertices and edges.
		- …
	3. To have a simple serialization format.
		- we do not want to ship around rows/vertices/edges/documents/columns/etc.
		- we want to make it easy for other languages to integrate with the TP4 VM.
		- we want to make it easy to create TP4 VMs in other languages.
		- ...
	4. To have a theoretical understanding of the relationship between the various data structures.
		- “this is just a that” is useful to limit the complexities of our codebase and explain to the public how different database relate.

Without further ado...

########################
# THE UNIVERSAL MODEL #
########################

*** This is as I understand it. I will let Josh decide whether I captured his ideas correctly. ***
*** All subsequent x().y().z() expressions are BYTECODE, not GREMLIN (just using an easier syntax then [op,arg*]*. ***

The objects:
	1. primitives: floats, doubles, Strings, ints, etc.
	2. tuples: key’d collections of primitives. (instances)
	3. relations: groupings of tuples with ?equivalent? schemas. (types)

The instructions:
	1. relations can be “queried” for matching tuples.
	2. tuple values can be projected out to yield primitives.

Lets do a “traversal” from marko to the people he knows.

// g.V().has(‘name’,’marko’).outE(‘knows’).inV().values(‘name’)

db(‘person’).has(‘name’,’marko’).as(‘x’).
db(‘knows’).has(‘#outV’, path(‘x’).by(‘#id’)).as(‘y’).
db(‘person’).has(‘#id’, path(‘y’).by(‘#inV’)).
  values(‘name’)

While the above is a single stream of processing, I will state what each line above has at that point in the stream.
	- [#label:person,name:marko,age:29]
	- [#label:knows,#outV:1,#inV:2,weight:0.5], ...
	- [#label:person,name:vadas,age:27], ...
	- vadas, ...
Databases strategies can be smart to realize that only the #id or #inV or #outV of the previous object is required and thus, limit what is actually accessed and flow’d through the processing engine.
	- [#id:1]
	- [#id:0,#inV:2] …
	- [#id:2,name:vadas] …
	- vadas, ...
*** More on such compiler optimizations (called strategies) later ***

POSITIVE NOTES:

	1. All relations are ‘siblings’ accessed via db().
		- There is no concept of nesting data. A very flat structure.
	2. All subsequent has()/where()/is()/etc.-filter steps after db() define the pattern match query.
		- It is completely up to the database to determine how to retrieve matching tuples.
		- For example: using indices, pointer chasing, linear scans w/ filter, etc.
	3. All subsequent map()/flatmap()/etc. steps are projections of data in the tuple.
		- The database returns key’d tuples composed of primitives.
		- Primitive data can be accessed and further processed. (projections)
	4. The bytecode describes a computation that is irrespective of the underlying database’s encoding of that structure.
		- Amazon Neptune, MySQL, Cassandra, Spark, Hadoop, Ignite, etc. can be fed the same bytecode and will yield the same result.
		- In other words, given the example above. all databases can now process property graph traversals.

NEGATIVE NOTES:

	1. Every database has to have a concept of grouping similar tuples.
	2. It implies an a priori definition of types (at least their existence and how to map data to them).
	3. It implies a particular type of data model even though its represented using the “universal model."
		- the example above is a “property graph query” because of #outV, #inV, etc. schema’d keys.
		- the above example is a “vertex/edge-labeled property graph query”  because ‘person’ and ‘knows’ relations.
		- the above example implies that keys are unique to relations. (e.g. name=marko — why db(‘person’)?)
			- though db().has(‘name’,’marko’) can be used to search all relations.
	4. It requires the use of path()-data.
		- though we could come up with an efficient traverser.last() which returns the previous object touched.
		- However, for multi-db() relation matches, as().path() will have to be used.
			- This can be optimized out by property graph databases as they support pointer chasing. (** more on this later **)

We can relax ‘apriori’-typing to enable ’name’=‘marko’ to be in any relation group, not just people relations. Also, lets use the concept of last() from (4). 

// g.V().has(‘name’,’marko’).outE(‘knows’).inV().values(‘name’)

db(‘vertices’).has(‘name’,’marko’).
db(‘edges’).has(‘#label’,’knows’).has(‘#outV’, last().by(‘#id’)).
db(‘vertices’).has(‘#label’,’person’).has(‘#id’, last().by(‘#inV’)).values(‘name’)

We can make typing completely dynamic and thus, relation groups don’t exist in the “universal model.” Thus, databases don’t have to even have a concept of groups of relations. However, databases can have relation groups via “indices" on #type, #type+#label, etc.

// g.V().has(‘name’,’marko’).outE(‘knows’).inV().values(‘name’)

db().has(’#type’,’vertex’).has(‘name’,’marko’).
db().has(‘#type’,’edge’).has(‘#label’,’knows’).has(‘#outV’, last().by(‘#id’)).
db().has(‘#type’,’vertex’).has(‘#label’,’person’).has(‘#id’, last().by(‘#inV’)).values(‘name’)

The above really states that we are dealing with an “vertex/edge-labeled property graph”. This is not bad, because we already had the problem of the existence of #inV/#outE/etc. so this isn’t any more limiting. Next, TP4 bytecode is starting to look like SPARQL pattern matching. There are tuples and we are matching patterns where data in some tuple equals (or general predicate) data in another tuple, etc. The “universal model” is just a sequence of key’d tuples with variable keys and lengths! (like an n-tuple store).

#############################################
# TP4 VM EXECUTION OF THE UNIVERSAL MODEL #
#############################################

All integrating database providers must support the “universal model" db() instruction. Its easy to implement, but is inefficient because bytecode using that instruction require a bunch of back-and-forths of data from DB to TP4 VM. Thus, TP4 will provide strategies to map db().filter()*-bytecode (i.e. universal model instructions) to instructions that respect their native structure.

Every database provider implements the TP4 interfaces that captures their native database encoding.
	- For example, RDBMS: https://github.com/apache/tinkerpop/tree/tp4/java/machine/machine-core/src/main/java/org/apache/tinkerpop/machine/structure/rdbms <https://github.com/apache/tinkerpop/tree/tp4/java/machine/machine-core/src/main/java/org/apache/tinkerpop/machine/structure/rdbms>
	- For example, Property Graph: https://github.com/apache/tinkerpop/tree/tp4/java/machine/machine-core/src/main/java/org/apache/tinkerpop/machine/structure/graph <https://github.com/apache/tinkerpop/tree/tp4/java/machine/machine-core/src/main/java/org/apache/tinkerpop/machine/structure/graph>
	- For example, RDF: https://github.com/apache/tinkerpop/tree/tp4/java/machine/machine-core/src/main/java/org/apache/tinkerpop/machine/structure/rdf <https://github.com/apache/tinkerpop/tree/tp4/java/machine/machine-core/src/main/java/org/apache/tinkerpop/machine/structure/rdf>
	- For example, Wide-Column…
	- For example, Document…
	- For example, HyperGraph…
	- etc.
TP4 will have lots of these interface packages (which will also include compiler strategies and instructions).
	
The db()-filter()* “universal model” bytecode is submitted to the TP4 VM. The TP4 VM looks at the integrated databases’ native structure (according to the interfaces it implements) and rewrites all db().filter()*-aspects of the submitted bytecode to a database-type specific instruction set that:
	1. respects the semantics of the the underlying database encoding.
	2. respects the semantics of TP4’s stream processing (i.e. linear/nested functions)
For example, the previous “universal model" bytecode is rewritten for each database type as:

Property graphs:
	pg:V().has(‘name’,’marko’).pg:outE(‘knows’).pg:inV().values(‘name’)

RDBMS:
  rdbms:R(‘person’).has(‘name’,’marko’)).
    join(rdbms:R(‘knows’)).by(’#id’,eq(‘#outV’)).
    join(rdbms:R(‘person’)).by(‘#inV’,eq(‘#id’)).values(‘name’)
	
RDF:
  rdf:T().has(’p’,’rdf:type’).has(‘o’,’foaf:Person’).as(‘a’).
  rdf:T().has(’s’,path(‘a’).by(’s’)).has(‘p’,’foaf:name’).has(‘o’,’marko^^xsd:string’).
  rdf:T().has(’s’,path(‘a').by(’s’)).has(‘p’,’#outE’).as(‘b’).
  rdf:T().has(’s’,path(‘b').by(’o’)).has(‘p’,’rdf:type’).has(‘o’,’foaf:knows’).as(‘c’).
  rdf:T().has(’s’,path(‘c’).by(‘o’)).has(‘p’,’#inV’).as(‘d’).
  rdf:T().has(’s’,path(‘d’).by(‘o’)).has(‘p,’rdf:name’).values(‘o’)

Next, TP4 will have strategies that can be generally applied to each database-type to further optimize the bytecode.

Property graphs:
	pg:V().has(‘name’,’marko’).pg:out(‘knows’).values(‘name’)

RDBMS:
	rdbms:sql(“SELECT name FROM person,knows,person WHERE p1.id=knows.inV …”)
	
RDF:
	rdf:sparql(“SELECT ?e WHERE { ?x rdf:type foaf:Person. ?x foaf:name marko^^xsd …”)

Finally, vendors can then apply their custom strategies. For instance, for JanusGraph:

jg:v-index(’name’,’marko’,grab(‘out-edges')).jg:out(‘knows’,grab(‘in-vertex’,’name-property').values(‘name’)

* The “universal model” instruction set must be supported by every database type. [all databases]
* The database-type specific instructions (e.g. V(), sparql(), sql(), out(), etc.) are only required to be understood by databases that implement that type interface. [database class]
* All  vendor-specific instructions (e.g. jg:v-index()) are only required to be understood by that particular database. [database instance]

NOTES:
	1. Optimizations such as sql(), sparql(), etc. are only for bytecode fragments that can be universally optimized for that particular class of databases.
	2. Results from sql(), sparql(), etc. can be subjected to further TP4 stream processing via repeat(), union(), choose(), etc. etc.
		- unfortunately my running example wasn’t complex enough to capture this. :(
		- the more we can pull out of TP4 bytecode and put into sql(), sparql(), etc. the better.
		- however, some query languages don’t have the respective expressivity for all types of computations (e.g. looping, branching, etc.).
			- in such situations, processing moves from DB to TP4 to DB to TP4 accordingly.
	3. We have an algorithmic way of mapping databases.
		- The RDBMS query shows there is a “property graph” encoded in tables.
		- The RDF query shows that there is a “property graph” encoded in triples.

In summary:

	1. There is a universal model and a universal instruction set.
	2. Databases integrate with the TP4 VM via “native database type”-interfaces.
	3. Submitted universal bytecode is rewritten to a database-type specific bytecode that respects the native semantics of that database-type. [decoration strategies]
	4. TP4 can further strategize that bytecode to take advantage of optimizations that are universal to that database-type. [optimization strategies]
	5. The underlying database can further strategize that bytecode to take unique advantage of their custom optimization features. [provider strategies]

################################
# WHY GO TO ALL THIS TROUBLE? #
################################

The million dollar question:
	
	"Why would you want to encode an X data structure into a database that natively supports a Y data structure?”

Answer:
	1. Its not just about databases, its about data formats in general.
		- The "universal model" allows database providers easy access to OLAP processors that have a different native structure than them.
			E.g. Spark RDDs, Hadoop SequenceFiles, Beam tuples, ...
	2. In some scenarios, a Y-database is better at processing X-type data structure than the currently existing native X-databases.
		- E.g., JanusGraph is a successful graph database product that encodes a property graph in a wide-column store.
			- JanusGraph provides graph sharding, distributed read/write from OLAP processing, high-concurrency, fault tolerance, global distribution, etc.
	3. Database providers can efficiently support other data structures that are simply "constrained versions" of their native structure. 
		- E.g., Amazon Neptune can support RDF even if their native structure is Property Graph.
			- According to the “universal model,” RDF is a restriction on property graphs.
				- RDF is just no properties and URI-based identifiers.
	4. “Agnostic” data(bases) such Redis, Ignite, Spark, etc. can easily support common data structures and their respective development communities.
		- With TP4, vendors can expand their product offering into communities they are only tangentially aware of.
			- E.g. Redis can immediately “jump into” the RDF space without having background knowledge of that space.
			- E.g. Ignite can immediately “jump into” the property graph space...
			- E.g. Spark can immediately “jump into” the document space…
	5. All TP4-enabled processors automatically work over all TP4-enabled databases.
		- JanusGraph gets dynamic query routing with Akka.
		- Amazon Neptune gets multi-threaded query execution with rxJava.
		- ComosDB gets cluster-oriented OLAP query execution with Spark.
		- …
	6. Language designers that have compilers to TP4 bytecode can work with all supporting TP4 databases/processors.
		- Neo4j no longer has to convince vendors to implement Cypher.
		- Amazon doesn’t have to choose between Gremlin, SPARQL, Cypher, etc.
			- Their customers can use their favorite language.
				- Obviously, some languages are better at expressing certain computations than others (e.g. SQL over graphs is horrible).
				- Some impedance mismatch issues can arise (e.g. RDF requires URIs for ids).
		- A plethora of new languages may emerge as designers don’t have to convince vendors to support it.
			- Language designers only have to develop a compiler to TP4 bytecode.
		
And there you have it — I believe Apache TinkerPop is on the verge of offering a powerful new data(base) theory and technology.

	The Database Virtual Machine

Thanks for reading,
Marko.

http://rredux.com <http://rredux.com/>




> On Apr 30, 2019, at 4:47 PM, Marko Rodriguez <ok...@gmail.com> wrote:
> 
> Hello,
> 
>> First, the "root". While we do need context for traversals, I don't think
>> there should be a distinct kind of root for each kind of structure. Once
>> again, select(), or operations derived from select() will work just fine.
> 
> So given your example below, “root” would be db in this case. 
> db is the reference to the structure as a whole.
> Within db, substructures exist. 
> Logically, this makes sense.
> For instance, a relational database’s references don’t leak outside the RDBMs into other areas of your computer’s memory.
> And there is always one entry point into every structure — the connection. And what does that connection point to:
> 	vertices, keyspaces, databases, document collections, etc. 
> In other words, “roots.” (even the JVM has a “root” — it called the heap).
> 
>> Want the "person" table? db.select("person"). Want a sequence of vertices
>> with the label "person"? db.select("person"). What we are saying in either
>> case is "give me the 'person' relation. Don't project any specific fields;
>> just give me all the data". A relational DB and a property graph DB will
>> have different ways of supplying the relation, but in either case, it can
>> hide behind the same interface (TRelation?).
> 
> In your lexicon, for both RDBMS and graph:
> 	db.select(‘person’) is saying, select the people table (which is composed of a sequence of “person" rows)
> 	db.select(‘person’) is saying, select the person vertices (which is composed of a sequence of “person" vertices)
> …right off the bat you have the syntax-problem of people vs. person. Tables are typically named the plural of the rows. That
> doesn’t exist in graph databases as there is just one vertex set (i.e. one “table”).
> 
> In my lexicon (TP instructions)
> 	db().values(‘people’) is saying, flatten out the person rows of the people table.
> 	V().has(label,’person’) is saying, flatten out the vertex objects of the graph’s vertices and filter out non-person vertices.
> 
> Well, that is stupid, why not have the same syntax for both structures?
> Because they are different. There are no “person” relations in the classic property graph (Neo4j 1.0). There are only vertex relations with a label=person entry.
> In a relational database there are “person” relations and these are bundled into disjoint tables (i.e. relation sets — and schema constrained).
> 
> The point I’m making is that instead of trying to fit all these data structures into a strict type system that ultimately looks like
> a bunch of disjoint relational sets, lets mimic the vendor-specified semantics. Lets take these systems at their face value
> and not try and “mathematize” them. If they are inconsistent and ugly, fine. If we map them into another system that is mathematical
> and beautiful, great. However, every data structure, from Neo4j’s representation for OLTP traversals
>  to that “same" data being OLAP processed as Spark RDDs or Hadoop
> SequenceFiles will all have their ‘oh shits’ (impedance mismatches) and that is okay. As this is the reality we are tying to model!
> 
> Graph and RDBMs have two different data models (their unique worldview):
> 
> RDBMS:   Databases->Tables->Rows->Primitives
> GraphDB: Vertices->Edges->Vertices->Edges->Vertices-> ...
> 
> Here is a person->knows->person “traversal” in TP4 bytecode over an RDBMS (#key are ’symbols’ (constants)):
> 
> db().values(“people”).as(“x”).
> db().values(“knows”).as(“y”).
>   where(“x”,eq(“y”)).by(#id).by(#outV).
> db().values(“people”).as(“z”).
>   where(“y”,eq(“z”)).by(#inV).by(#id)
>    
> Pretty freakin’ disgusting, eh? Here is a person->knows->person “traversal” in TP4 bytecode over a property graph:
> 
> V().has(#label,”person”).values(#outE).has(#label,”knows”).values(#inV)
> 
> So we have two completely different bytecode representations for the same computational result. Why?
> Because we have two completely different data models!
> 
> 	One is a set of disjoint typed-relations (i.e. RDBMS).
> 	One is a set of nested loosely-typed-relations (i.e. property graphs).
> 
> Why not make them the same? Because they are not the same and that is exactly what I believe we should be capturing.
> 
> Just looking at the two computations above you see that a relational database is doing “joins” while a graph database is doing “traversals”.
> We have to use path-data to compute a join. We have to use memory! (and we do). We don’t have to use path-data to compute a traversal.
> We don’t have to use memory! (and we don’t!). That is the fundamental nature of the respective computations that are taking place.
> That is what gives each system their particular style of computing.
> 
> NEXT: There is nothing that says you can’t map between the two? Lets go property graph to RDBMS.
> 	- we could make a person table, a software table, a knows table, a created table.
> 		- that only works if the property graph is schema-based.
> 	- we could make a single vertex table with another 3 column properties table (vertexId,key,value)
> 	- we could…
> Which ever encoding you choose, a different bytecode will be required. Fortunately, the space of (reasonable) possibilities is constrained.
> Thus, instead of saying: 
> 	“I want to map from property graph to RDBMS” 
> I say: 
> 	“I want to map from a recursive, bi-relational structure to a disjoint multi-relational structure where linkage is based on #id/#outV/#inV equalities.”
> Now you have constrained the space of possible RDBMS encodings! Moreover, we now have an algorithmic solution that not only disconnects “vertices,” 
> but also rewrites the bytecode according to the new logical steps required to execute the computation as we have a new data structure and a new
> way of moving through that data structure. The pointers are completely different! However, as long as the mapping is sound, the rewrite should be algorithmic.
> 
> I’m getting tired. I see your stuff below about indices and I have thoughts on that… but I will address those tomorrow.
> 
> Thanks for reading,
> Marko.
> 
> http://rredux.com <http://rredux.com/>
> 
> 
> 
> 
> 
> 
> 
>> 
>> But wait, you say, what if the under the hood, you have a TTable in one
>> case, and TSequence in the other? They are so different! That's why
>> the Dataflow
>> Model
>> <https://storage.googleapis.com/pub-tools-public-publication-data/pdf/43864.pdf <https://storage.googleapis.com/pub-tools-public-publication-data/pdf/43864.pdf>>
>> is so great; to an extent, you can think of the two as interchangeable. I
>> think we would get a lot of mileage out of treating them as interchangeable
>> within TP4.
>> 
>> So instead of a data model -specific "root", I argue for a universal root
>> together with a set of relations and what we might call an "indexes". An
>> index is an arrow from a type to a relation which says "give me a
>> column/value pair, and I will give you all matching tuples from this
>> relation". The result is another relation. Where data sources differentiate
>> themselves is by having different relations and indexes.
>> 
>> For example, if the underlying data structure is nothing but a stream of
>> Trip tuples, you will have a single relation "Trip", and no indexes. Sorry;
>> you just have to wait for tuples to go by, and filter on them. So if you
>> say d.select("Trip", "driver") -- where d is a traversal that gets you to a
>> User -- the machine knows that it can't use "driver" to look up a specific
>> set of trips; it has to use a filter over all future "Trip" tuples. If, on
>> the other hand, we have a relational database, we have the option of
>> indexing on "driver". In this case, d.select("Trip", "driver") may take you
>> to a specific table like "Trip_by_driver" which has "driver" as a primary
>> key. The machine recognizes that this index exists, and uses it to answer
>> the query more efficiently. The alternative is to do a full scan over any
>> table which contains the "Trip" relation. Since TinkerPop3, we have been
>> without a vendor-neutral API for indexes, but this is where such an API
>> would really start to shine. Consider Neo4j's single property indexes,
>> JanusGraph's composite indexes, and even RDF triple indices (spo, ops,
>> etc.) as in AllegroGraph in addition to primary keys in relational
>> databases.
>> 
>> TTuple -- cool. +1
>> 
>> "Enums" -- I agree that enums are necessary, but we need even more: tagged
>> unions <https://en.wikipedia.org/wiki/Tagged_union <https://en.wikipedia.org/wiki/Tagged_union>>. They are part of the
>> system of algebraic data types which I described on Friday. An enum is a
>> special case of a tagged union in which there is no value, just a type tag.
>> May I suggest something like TValue, which contains a value (possibly
>> trivial) together with a type tag. This enables ORs and pattern matching.
>> For example, suppose "created" edges are allowed to point to either
>> "Project" or "Document" vertices. The in-type of "created" is
>> union{project:Project, document:Document). Now the in value of a specific
>> edge can be TValue("project", [some project vertex]) or TValue("document",
>> [some document vertex]) and you have the freedom to switch on the type tag
>> if you want to, e.g. the next step in the traversal can give you the "name"
>> of the project or the "title" of the document as appropriate.
>> 
>> Multi-properties -- agreed; has() is good enough.
>> 
>> Meta-properties -- again, this is where I think we should have a
>> lower-level select() operation. Then has() builds on that operation.
>> Whereas select() matches on fields of a relation, has() matches on property
>> values and other higher-order things. If you want properties of properties,
>> don't use has(); use select()/from(). Most of the time, you will just want
>> to use has().
>> 
>> Agreed that every *entity* should have an id(), and also a label() (though
>> it should always be possible to infer label() from the context). I would
>> suggest TEntity (or TElement), which has id(), label(), and value(), where
>> value() provides the raw value (usually a TTuple) of the entity.
>> 
>> Josh
>> 
>> 
>> 
>> On Mon, Apr 29, 2019 at 10:35 AM Marko Rodriguez <okrammarko@gmail.com <ma...@gmail.com>>
>> wrote:
>> 
>>> Hello Josh,
>>> 
>>>> A has("age",29), for example, operates at a different level of
>>> abstraction than a
>>>> has("city","Santa Fe") if "city" is a column in an "addresses" table.
>>> 
>>> So hasXXX() operators work on TTuples. Thus:
>>> 
>>> g.V().hasLabel(‘person’).has(‘age’,29)
>>> g.V().hasLabel(‘address’).has(‘city’,’Santa Fe’)
>>> 
>>> ..both work as a person-vertex and an address-vertex are TTuples. If these
>>> were tables, then:
>>> 
>>> jdbc.db().values(‘people’).has(‘age’,29)
>>> jdbc.db().values(‘addresses’).has(‘city’,’Santa Fe’)
>>> 
>>> …also works as both people and addresses are TTables which extend
>>> TTuple<String,?>.
>>> 
>>> In summary, its its a TTuple, then hasXXX() is good go.
>>> 
>>> ////////// IGNORE UNTIL AFTER READING NEXT SECTION //////////
>>> *** SIDENOTE: A TTable (which is a TSequence) could have Symbol-based
>>> metadata. Thus TTable.value(#label) -> “people.” If so, then
>>> jdbc.db().hasLabel(“people”).has(“age”,29)
>>> 
>>>> At least, they
>>>> are different if the data model allows for multi-properties,
>>>> meta-properties, and hyper-edges. A property is something that can either
>>>> be there, attached to an element, or not be there. There may also be more
>>>> than one such property, and it may have other properties attached to it.
>>> A
>>>> column of a table, on the other hand, is always there (even if its value
>>> is
>>>> allowed to be null), always has a single value, and cannot have further
>>>> properties attached.
>>> 
>>> 1. Multi-properties.
>>> 
>>> Multi-properties works because if name references a TSequence, then its
>>> the sequence that you analyze with has(). This is another reason why
>>> TSequence is important. Its a reference to a “stream” so there isn’t
>>> another layer of tuple-nesting.
>>> 
>>> // assume v[1] has name={marko,mrodriguez,markor}
>>> g.V(1).value(‘name’) => TSequence<String>
>>> g.V(1).values(‘name’) => marko, mrodriguez, markor
>>> g.V(1).has(‘name’,’marko’) => v[1]
>>> 
>>> 2. Meta-properties
>>> 
>>> // assume v[1] has name=[value:marko,creator:josh,timestamp:12303] // i.e.
>>> a tuple value
>>> g.V(1).value(‘name’) => TTuple<?,String> // doh!
>>> g.V(1).value(‘name’).value(‘value’) => marko
>>> g.V(1).value(‘name’).value(‘creator’) => josh
>>> 
>>> So things get screwy. — however, it only gets screwy when you mix your
>>> “metadata” key/values with your “data” key/values. This is why I think
>>> TSymbols are important. Imagine the following meta-property tuple for v[1]:
>>> 
>>> [#value:marko,creator:josh,timestamp:12303]
>>> 
>>> If you do g.V(1).value(‘name’), we could look to the value indexed by the
>>> symbol #value, thus => “marko”.
>>> If you do g.V(1).values(‘name’), you would get back a TSequence with a
>>> single TTuple being the meta property.
>>> If you do g.V(1).values(‘name’).value(), we could get the value indexed by
>>> the symbol #value.
>>> If you do g.V(1).values(‘name’).value(‘creator’), it will return the
>>> primitive string “josh”.
>>> 
>>> I believe that the following symbols should be recommended for use across
>>> all data structures.
>>>        #id, #label, #key, #value
>>> …where id(), label(), key(), value() are tuple.get(Symbol). Other symbols
>>> for use with propertygraph/ include:
>>>        #outE, #inV, #inE, #outV, #bothE, #bothV
>>> 
>>>> In order to simplify user queries, you can let has() and values() do
>>> double
>>>> duty, but I still feel that there are lower-level operations at play, at
>>> a
>>>> logical level even if not at a bytecode level. However, expressing the a
>>>> traversal in terms of its lowest-level relational operations may also be
>>>> useful for query optimization.
>>> 
>>> One thing that I’m doing, that perhaps you haven’t caught onto yet, is
>>> that I’m not modeling everything in terms of “tables.” Each data structure
>>> is trying to stay as pure to its conceptual model as possible. Thus, there
>>> are no “joins” in property graphs as outE() references a TSequence<TEdge>,
>>> where TEdge is an interface that extends TTuple. You can just walk without
>>> doing any type of INNER JOIN. Now, if you model a property graph in a
>>> relational database, you will have to strategize the bytecode accordingly!
>>> Just a heads up in case you haven’t noticed that.
>>> 
>>> Thanks for your input,
>>> Marko.
>>> 
>>> http://rredux.com <http://rredux.com/> <http://rredux.com/ <http://rredux.com/>>
>>> 
>>> 
>>> 
>>>> 
>>>> Josh
>>>> 
>>>> 
>>>> 
>>>> On Mon, Apr 29, 2019 at 7:34 AM Marko Rodriguez <okrammarko@gmail.com <ma...@gmail.com>
>>> <mailto:okrammarko@gmail.com <ma...@gmail.com>>>
>>>> wrote:
>>>> 
>>>>> Hi,
>>>>> 
>>>>> *** This email is primarily for Josh (and Kuppitz). However, if others
>>> are
>>>>> interested… ***
>>>>> 
>>>>> So I did a lot of thinking this weekend about structure/ and this
>>> morning,
>>>>> I prototyped both graph/ and rdbms/.
>>>>> 
>>>>> This is the way I’m currently thinking of things:
>>>>> 
>>>>>       1. There are 4 base types in structure/.
>>>>>               - Primitive: string, long, float, int, … (will constrain
>>>>> these at some point).
>>>>>               - TTuple<K,V>: key/value map.
>>>>>               - TSequence<V>: an iterable of v objects.
>>>>>               - TSymbol: like Ruby, I think we need “enum-like” symbols
>>>>> (e.g., #id, #label).
>>>>> 
>>>>>       2. Every structure has a “root.”
>>>>>               - for graph its TGraph implements TSequence<TVertex>
>>>>>               - for rdbms its a TDatabase implements
>>>>> TTuple<String,TTable>
>>>>> 
>>>>>       3. Roots implement Structure and thus, are what is generated by
>>>>> StructureFactory.mint().
>>>>>               - defined using withStructure().
>>>>>               - For graph, its accessible via V().
>>>>>               - For rdbms, its accessible via db().
>>>>> 
>>>>>       4. There is a list of core instructions for dealing with these
>>>>> base objects.
>>>>>               - value(K key): gets the TTuple value for the provided
>>> key.
>>>>>               - values(K key): gets an iterator of the value for the
>>>>> provided key.
>>>>>               - entries(): gets an iterator of T2Tuple objects for the
>>>>> incoming TTuple.
>>>>>               - hasXXX(A,B): various has()-based filters for looking
>>>>> into a TTuple and a TSequence
>>>>>               - db()/V()/etc.: jump to the “root” of the
>>> withStructure()
>>>>> structure.
>>>>>               - drop()/add(): behave as one would expect and thus.
>>>>> 
>>>>> ————
>>>>> 
>>>>> For RDBMS, we have three interfaces in rdbms/.
>>>>> (machine/machine-core/structure/rdbms)
>>>>> 
>>>>>       1. TDatabase implements TTuple<String,TTable> // the root
>>>>> structure that indexes the tables.
>>>>>       2. TTable implements TSequence<TRow<?>> // a table is a sequence
>>>>> of rows
>>>>>       3. TRow<V> implements TTuple<String,V>> // a row has string
>>> column
>>>>> names
>>>>> 
>>>>> I then created a new project at machine/structure/jdbc). The classes in
>>>>> here implement the above rdbms/ interfaces/
>>>>> 
>>>>> Here is an RDBMS session:
>>>>> 
>>>>> final Machine machine = LocalMachine.open();
>>>>> final TraversalSource jdbc =
>>>>>       Gremlin.traversal(machine).
>>>>>                       withProcessor(PipesProcessor.class).
>>>>>                       withStructure(JDBCStructure.class,
>>>>> Map.of(JDBCStructure.JDBC_CONNECTION, "jdbc:h2:/tmp/test"));
>>>>> 
>>>>> System.out.println(jdbc.db().toList());
>>>>> System.out.println(jdbc.db().entries().toList());
>>>>> System.out.println(jdbc.db().value("people").toList());
>>>>> System.out.println(jdbc.db().values("people").toList());
>>>>> System.out.println(jdbc.db().values("people").value("name").toList());
>>>>> System.out.println(jdbc.db().values("people").entries().toList());
>>>>> 
>>>>> This yields:
>>>>> 
>>>>> [<database#conn1: url=jdbc:h2:/tmp/test user=>]
>>>>> [PEOPLE:<table#PEOPLE>]
>>>>> [<table#people>]
>>>>> [<row#PEOPLE:1>, <row#PEOPLE:2>]
>>>>> [marko, josh]
>>>>> [NAME:marko, AGE:29, NAME:josh, AGE:32]
>>>>> 
>>>>> The bytecode of the last query is:
>>>>> 
>>>>> [db(<database#conn1: url=jdbc:h2:/tmp/test user=>), values(people),
>>>>> entries]
>>>>> 
>>>>> JDBCDatabase implements TDatabase, Structure.
>>>>>       *** JDBCDatabase is the root structure and is referenced by db()
>>>>> *** (CRUCIAL POINT)
>>>>> 
>>>>> Assume another table called ADDRESSES with two columns: name and city.
>>>>> 
>>>>> 
>>>>> 
>>> jdbc.db().values(“people”).as(“x”).db().values(“addresses”).has(“name”,eq(path(“x”).by(“name”))).value(“city”)
>>>>> 
>>>>> The above is equivalent to:
>>>>> 
>>>>> SELECT city FROM people,addresses WHERE people.name=addresses.name
>>>>> 
>>>>> If you want to do an inner join (a product), you do this:
>>>>> 
>>>>> 
>>>>> 
>>> jdbc.db().values(“people”).as(“x”).db().values(“addresses”).has(“name”,eq(path(“x”).by(“name”))).as(“y”).path(“x”,”y")
>>>>> 
>>>>> The above is equivalent to:
>>>>> 
>>>>> SELECT * FROM addresses INNER JOIN people ON people.name=addresses.name
>>>>> 
>>>>> NOTES:
>>>>>       1. Instead of select(), we simply jump to the root via db() (or
>>>>> V() for graph).
>>>>>       2. Instead of project(), we simply use value() or values().
>>>>>       3. Instead of select() being overloaded with by() join syntax, we
>>>>> use has() and path().
>>>>>               - like TP3 we will be smart about dropping path() data
>>>>> once its no longer referenced.
>>>>>       4. We can also do LEFT and RIGHT JOINs (haven’t thought through
>>>>> FULL OUTER JOIN yet).
>>>>>               - however, we don’t support ‘null' in TP so I don’t know
>>>>> if we want to support these null-producing joins. ?
>>>>> 
>>>>> LEFT JOIN:
>>>>>       * If an address doesn’t exist for the person, emit a
>>> “null”-filled
>>>>> path.
>>>>> 
>>>>> jdbc.db().values(“people”).as(“x”).
>>>>> db().values(“addresses”).as(“y”).
>>>>>   choose(has(“name”,eq(path(“x”).by(“name”))),
>>>>>     identity(),
>>>>>     path(“y”).by(null).as(“y”)).
>>>>> path(“x”,”y")
>>>>> 
>>>>> SELECT * FROM addresses LEFT JOIN people ON people.name=addresses.name
>>>>> 
>>>>> RIGHT JOIN:
>>>>> 
>>>>> jdbc.db().values(“people”).as(“x”).
>>>>> db().values(“addresses”).as(“y”).
>>>>>   choose(has(“name”,eq(path(“x”).by(“name”))),
>>>>>     identity(),
>>>>>     path(“x”).by(null).as(“x”)).
>>>>> path(“x”,”y")
>>>>> 
>>>>> 
>>>>> SUMMARY:
>>>>> 
>>>>> There are no “low level” instructions. Everything is based on the
>>> standard
>>>>> instructions that we know and love. Finally, if not apparent, the above
>>>>> bytecode chunks would ultimately get strategized into a single SQL query
>>>>> (breadth-first) instead of one-off queries (depth-first) to improve
>>>>> performance.
>>>>> 
>>>>> Neat?,
>>>>> Marko.
>>>>> 
>>>>> http://rredux.com <http://rredux.com/> <http://rredux.com/ <http://rredux.com/>> <http://rredux.com/ <http://rredux.com/> <
>>> http://rredux.com/ <http://rredux.com/>>>
> 


Re: The Fundamental Structure Instructions Already Exist! (w/ RDBMS Example)

Posted by Marko Rodriguez <ok...@gmail.com>.
Hello,

> First, the "root". While we do need context for traversals, I don't think
> there should be a distinct kind of root for each kind of structure. Once
> again, select(), or operations derived from select() will work just fine.

So given your example below, “root” would be db in this case. 
db is the reference to the structure as a whole.
Within db, substructures exist. 
Logically, this makes sense.
For instance, a relational database’s references don’t leak outside the RDBMs into other areas of your computer’s memory.
And there is always one entry point into every structure — the connection. And what does that connection point to:
	vertices, keyspaces, databases, document collections, etc. 
In other words, “roots.” (even the JVM has a “root” — it called the heap).

> Want the "person" table? db.select("person"). Want a sequence of vertices
> with the label "person"? db.select("person"). What we are saying in either
> case is "give me the 'person' relation. Don't project any specific fields;
> just give me all the data". A relational DB and a property graph DB will
> have different ways of supplying the relation, but in either case, it can
> hide behind the same interface (TRelation?).

In your lexicon, for both RDBMS and graph:
	db.select(‘person’) is saying, select the people table (which is composed of a sequence of “person" rows)
	db.select(‘person’) is saying, select the person vertices (which is composed of a sequence of “person" vertices)
…right off the bat you have the syntax-problem of people vs. person. Tables are typically named the plural of the rows. That
doesn’t exist in graph databases as there is just one vertex set (i.e. one “table”).

In my lexicon (TP instructions)
	db().values(‘people’) is saying, flatten out the person rows of the people table.
	V().has(label,’person’) is saying, flatten out the vertex objects of the graph’s vertices and filter out non-person vertices.

Well, that is stupid, why not have the same syntax for both structures?
Because they are different. There are no “person” relations in the classic property graph (Neo4j 1.0). There are only vertex relations with a label=person entry.
In a relational database there are “person” relations and these are bundled into disjoint tables (i.e. relation sets — and schema constrained).

The point I’m making is that instead of trying to fit all these data structures into a strict type system that ultimately looks like
a bunch of disjoint relational sets, lets mimic the vendor-specified semantics. Lets take these systems at their face value
and not try and “mathematize” them. If they are inconsistent and ugly, fine. If we map them into another system that is mathematical
and beautiful, great. However, every data structure, from Neo4j’s representation for OLTP traversals
 to that “same" data being OLAP processed as Spark RDDs or Hadoop
SequenceFiles will all have their ‘oh shits’ (impedance mismatches) and that is okay. As this is the reality we are tying to model!

Graph and RDBMs have two different data models (their unique worldview):

RDBMS:   Databases->Tables->Rows->Primitives
GraphDB: Vertices->Edges->Vertices->Edges->Vertices-> ...

Here is a person->knows->person “traversal” in TP4 bytecode over an RDBMS (#key are ’symbols’ (constants)):

db().values(“people”).as(“x”).
db().values(“knows”).as(“y”).
  where(“x”,eq(“y”)).by(#id).by(#outV).
db().values(“people”).as(“z”).
  where(“y”,eq(“z”)).by(#inV).by(#id)
   
Pretty freakin’ disgusting, eh? Here is a person->knows->person “traversal” in TP4 bytecode over a property graph:

V().has(#label,”person”).values(#outE).has(#label,”knows”).values(#inV)

So we have two completely different bytecode representations for the same computational result. Why?
Because we have two completely different data models!

	One is a set of disjoint typed-relations (i.e. RDBMS).
	One is a set of nested loosely-typed-relations (i.e. property graphs).

Why not make them the same? Because they are not the same and that is exactly what I believe we should be capturing.

Just looking at the two computations above you see that a relational database is doing “joins” while a graph database is doing “traversals”.
We have to use path-data to compute a join. We have to use memory! (and we do). We don’t have to use path-data to compute a traversal.
We don’t have to use memory! (and we don’t!). That is the fundamental nature of the respective computations that are taking place.
That is what gives each system their particular style of computing.

NEXT: There is nothing that says you can’t map between the two? Lets go property graph to RDBMS.
	- we could make a person table, a software table, a knows table, a created table.
		- that only works if the property graph is schema-based.
	- we could make a single vertex table with another 3 column properties table (vertexId,key,value)
	- we could…
Which ever encoding you choose, a different bytecode will be required. Fortunately, the space of (reasonable) possibilities is constrained.
Thus, instead of saying: 
	“I want to map from property graph to RDBMS” 
I say: 
	“I want to map from a recursive, bi-relational structure to a disjoint multi-relational structure where linkage is based on #id/#outV/#inV equalities.”
Now you have constrained the space of possible RDBMS encodings! Moreover, we now have an algorithmic solution that not only disconnects “vertices,” 
but also rewrites the bytecode according to the new logical steps required to execute the computation as we have a new data structure and a new
way of moving through that data structure. The pointers are completely different! However, as long as the mapping is sound, the rewrite should be algorithmic.

I’m getting tired. I see your stuff below about indices and I have thoughts on that… but I will address those tomorrow.

Thanks for reading,
Marko.

http://rredux.com <http://rredux.com/>







> 
> But wait, you say, what if the under the hood, you have a TTable in one
> case, and TSequence in the other? They are so different! That's why
> the Dataflow
> Model
> <https://storage.googleapis.com/pub-tools-public-publication-data/pdf/43864.pdf <https://storage.googleapis.com/pub-tools-public-publication-data/pdf/43864.pdf>>
> is so great; to an extent, you can think of the two as interchangeable. I
> think we would get a lot of mileage out of treating them as interchangeable
> within TP4.
> 
> So instead of a data model -specific "root", I argue for a universal root
> together with a set of relations and what we might call an "indexes". An
> index is an arrow from a type to a relation which says "give me a
> column/value pair, and I will give you all matching tuples from this
> relation". The result is another relation. Where data sources differentiate
> themselves is by having different relations and indexes.
> 
> For example, if the underlying data structure is nothing but a stream of
> Trip tuples, you will have a single relation "Trip", and no indexes. Sorry;
> you just have to wait for tuples to go by, and filter on them. So if you
> say d.select("Trip", "driver") -- where d is a traversal that gets you to a
> User -- the machine knows that it can't use "driver" to look up a specific
> set of trips; it has to use a filter over all future "Trip" tuples. If, on
> the other hand, we have a relational database, we have the option of
> indexing on "driver". In this case, d.select("Trip", "driver") may take you
> to a specific table like "Trip_by_driver" which has "driver" as a primary
> key. The machine recognizes that this index exists, and uses it to answer
> the query more efficiently. The alternative is to do a full scan over any
> table which contains the "Trip" relation. Since TinkerPop3, we have been
> without a vendor-neutral API for indexes, but this is where such an API
> would really start to shine. Consider Neo4j's single property indexes,
> JanusGraph's composite indexes, and even RDF triple indices (spo, ops,
> etc.) as in AllegroGraph in addition to primary keys in relational
> databases.
> 
> TTuple -- cool. +1
> 
> "Enums" -- I agree that enums are necessary, but we need even more: tagged
> unions <https://en.wikipedia.org/wiki/Tagged_union <https://en.wikipedia.org/wiki/Tagged_union>>. They are part of the
> system of algebraic data types which I described on Friday. An enum is a
> special case of a tagged union in which there is no value, just a type tag.
> May I suggest something like TValue, which contains a value (possibly
> trivial) together with a type tag. This enables ORs and pattern matching.
> For example, suppose "created" edges are allowed to point to either
> "Project" or "Document" vertices. The in-type of "created" is
> union{project:Project, document:Document). Now the in value of a specific
> edge can be TValue("project", [some project vertex]) or TValue("document",
> [some document vertex]) and you have the freedom to switch on the type tag
> if you want to, e.g. the next step in the traversal can give you the "name"
> of the project or the "title" of the document as appropriate.
> 
> Multi-properties -- agreed; has() is good enough.
> 
> Meta-properties -- again, this is where I think we should have a
> lower-level select() operation. Then has() builds on that operation.
> Whereas select() matches on fields of a relation, has() matches on property
> values and other higher-order things. If you want properties of properties,
> don't use has(); use select()/from(). Most of the time, you will just want
> to use has().
> 
> Agreed that every *entity* should have an id(), and also a label() (though
> it should always be possible to infer label() from the context). I would
> suggest TEntity (or TElement), which has id(), label(), and value(), where
> value() provides the raw value (usually a TTuple) of the entity.
> 
> Josh
> 
> 
> 
> On Mon, Apr 29, 2019 at 10:35 AM Marko Rodriguez <okrammarko@gmail.com <ma...@gmail.com>>
> wrote:
> 
>> Hello Josh,
>> 
>>> A has("age",29), for example, operates at a different level of
>> abstraction than a
>>> has("city","Santa Fe") if "city" is a column in an "addresses" table.
>> 
>> So hasXXX() operators work on TTuples. Thus:
>> 
>> g.V().hasLabel(‘person’).has(‘age’,29)
>> g.V().hasLabel(‘address’).has(‘city’,’Santa Fe’)
>> 
>> ..both work as a person-vertex and an address-vertex are TTuples. If these
>> were tables, then:
>> 
>> jdbc.db().values(‘people’).has(‘age’,29)
>> jdbc.db().values(‘addresses’).has(‘city’,’Santa Fe’)
>> 
>> …also works as both people and addresses are TTables which extend
>> TTuple<String,?>.
>> 
>> In summary, its its a TTuple, then hasXXX() is good go.
>> 
>> ////////// IGNORE UNTIL AFTER READING NEXT SECTION //////////
>> *** SIDENOTE: A TTable (which is a TSequence) could have Symbol-based
>> metadata. Thus TTable.value(#label) -> “people.” If so, then
>> jdbc.db().hasLabel(“people”).has(“age”,29)
>> 
>>> At least, they
>>> are different if the data model allows for multi-properties,
>>> meta-properties, and hyper-edges. A property is something that can either
>>> be there, attached to an element, or not be there. There may also be more
>>> than one such property, and it may have other properties attached to it.
>> A
>>> column of a table, on the other hand, is always there (even if its value
>> is
>>> allowed to be null), always has a single value, and cannot have further
>>> properties attached.
>> 
>> 1. Multi-properties.
>> 
>> Multi-properties works because if name references a TSequence, then its
>> the sequence that you analyze with has(). This is another reason why
>> TSequence is important. Its a reference to a “stream” so there isn’t
>> another layer of tuple-nesting.
>> 
>> // assume v[1] has name={marko,mrodriguez,markor}
>> g.V(1).value(‘name’) => TSequence<String>
>> g.V(1).values(‘name’) => marko, mrodriguez, markor
>> g.V(1).has(‘name’,’marko’) => v[1]
>> 
>> 2. Meta-properties
>> 
>> // assume v[1] has name=[value:marko,creator:josh,timestamp:12303] // i.e.
>> a tuple value
>> g.V(1).value(‘name’) => TTuple<?,String> // doh!
>> g.V(1).value(‘name’).value(‘value’) => marko
>> g.V(1).value(‘name’).value(‘creator’) => josh
>> 
>> So things get screwy. — however, it only gets screwy when you mix your
>> “metadata” key/values with your “data” key/values. This is why I think
>> TSymbols are important. Imagine the following meta-property tuple for v[1]:
>> 
>> [#value:marko,creator:josh,timestamp:12303]
>> 
>> If you do g.V(1).value(‘name’), we could look to the value indexed by the
>> symbol #value, thus => “marko”.
>> If you do g.V(1).values(‘name’), you would get back a TSequence with a
>> single TTuple being the meta property.
>> If you do g.V(1).values(‘name’).value(), we could get the value indexed by
>> the symbol #value.
>> If you do g.V(1).values(‘name’).value(‘creator’), it will return the
>> primitive string “josh”.
>> 
>> I believe that the following symbols should be recommended for use across
>> all data structures.
>>        #id, #label, #key, #value
>> …where id(), label(), key(), value() are tuple.get(Symbol). Other symbols
>> for use with propertygraph/ include:
>>        #outE, #inV, #inE, #outV, #bothE, #bothV
>> 
>>> In order to simplify user queries, you can let has() and values() do
>> double
>>> duty, but I still feel that there are lower-level operations at play, at
>> a
>>> logical level even if not at a bytecode level. However, expressing the a
>>> traversal in terms of its lowest-level relational operations may also be
>>> useful for query optimization.
>> 
>> One thing that I’m doing, that perhaps you haven’t caught onto yet, is
>> that I’m not modeling everything in terms of “tables.” Each data structure
>> is trying to stay as pure to its conceptual model as possible. Thus, there
>> are no “joins” in property graphs as outE() references a TSequence<TEdge>,
>> where TEdge is an interface that extends TTuple. You can just walk without
>> doing any type of INNER JOIN. Now, if you model a property graph in a
>> relational database, you will have to strategize the bytecode accordingly!
>> Just a heads up in case you haven’t noticed that.
>> 
>> Thanks for your input,
>> Marko.
>> 
>> http://rredux.com <http://rredux.com/> <http://rredux.com/ <http://rredux.com/>>
>> 
>> 
>> 
>>> 
>>> Josh
>>> 
>>> 
>>> 
>>> On Mon, Apr 29, 2019 at 7:34 AM Marko Rodriguez <okrammarko@gmail.com <ma...@gmail.com>
>> <mailto:okrammarko@gmail.com <ma...@gmail.com>>>
>>> wrote:
>>> 
>>>> Hi,
>>>> 
>>>> *** This email is primarily for Josh (and Kuppitz). However, if others
>> are
>>>> interested… ***
>>>> 
>>>> So I did a lot of thinking this weekend about structure/ and this
>> morning,
>>>> I prototyped both graph/ and rdbms/.
>>>> 
>>>> This is the way I’m currently thinking of things:
>>>> 
>>>>       1. There are 4 base types in structure/.
>>>>               - Primitive: string, long, float, int, … (will constrain
>>>> these at some point).
>>>>               - TTuple<K,V>: key/value map.
>>>>               - TSequence<V>: an iterable of v objects.
>>>>               - TSymbol: like Ruby, I think we need “enum-like” symbols
>>>> (e.g., #id, #label).
>>>> 
>>>>       2. Every structure has a “root.”
>>>>               - for graph its TGraph implements TSequence<TVertex>
>>>>               - for rdbms its a TDatabase implements
>>>> TTuple<String,TTable>
>>>> 
>>>>       3. Roots implement Structure and thus, are what is generated by
>>>> StructureFactory.mint().
>>>>               - defined using withStructure().
>>>>               - For graph, its accessible via V().
>>>>               - For rdbms, its accessible via db().
>>>> 
>>>>       4. There is a list of core instructions for dealing with these
>>>> base objects.
>>>>               - value(K key): gets the TTuple value for the provided
>> key.
>>>>               - values(K key): gets an iterator of the value for the
>>>> provided key.
>>>>               - entries(): gets an iterator of T2Tuple objects for the
>>>> incoming TTuple.
>>>>               - hasXXX(A,B): various has()-based filters for looking
>>>> into a TTuple and a TSequence
>>>>               - db()/V()/etc.: jump to the “root” of the
>> withStructure()
>>>> structure.
>>>>               - drop()/add(): behave as one would expect and thus.
>>>> 
>>>> ————
>>>> 
>>>> For RDBMS, we have three interfaces in rdbms/.
>>>> (machine/machine-core/structure/rdbms)
>>>> 
>>>>       1. TDatabase implements TTuple<String,TTable> // the root
>>>> structure that indexes the tables.
>>>>       2. TTable implements TSequence<TRow<?>> // a table is a sequence
>>>> of rows
>>>>       3. TRow<V> implements TTuple<String,V>> // a row has string
>> column
>>>> names
>>>> 
>>>> I then created a new project at machine/structure/jdbc). The classes in
>>>> here implement the above rdbms/ interfaces/
>>>> 
>>>> Here is an RDBMS session:
>>>> 
>>>> final Machine machine = LocalMachine.open();
>>>> final TraversalSource jdbc =
>>>>       Gremlin.traversal(machine).
>>>>                       withProcessor(PipesProcessor.class).
>>>>                       withStructure(JDBCStructure.class,
>>>> Map.of(JDBCStructure.JDBC_CONNECTION, "jdbc:h2:/tmp/test"));
>>>> 
>>>> System.out.println(jdbc.db().toList());
>>>> System.out.println(jdbc.db().entries().toList());
>>>> System.out.println(jdbc.db().value("people").toList());
>>>> System.out.println(jdbc.db().values("people").toList());
>>>> System.out.println(jdbc.db().values("people").value("name").toList());
>>>> System.out.println(jdbc.db().values("people").entries().toList());
>>>> 
>>>> This yields:
>>>> 
>>>> [<database#conn1: url=jdbc:h2:/tmp/test user=>]
>>>> [PEOPLE:<table#PEOPLE>]
>>>> [<table#people>]
>>>> [<row#PEOPLE:1>, <row#PEOPLE:2>]
>>>> [marko, josh]
>>>> [NAME:marko, AGE:29, NAME:josh, AGE:32]
>>>> 
>>>> The bytecode of the last query is:
>>>> 
>>>> [db(<database#conn1: url=jdbc:h2:/tmp/test user=>), values(people),
>>>> entries]
>>>> 
>>>> JDBCDatabase implements TDatabase, Structure.
>>>>       *** JDBCDatabase is the root structure and is referenced by db()
>>>> *** (CRUCIAL POINT)
>>>> 
>>>> Assume another table called ADDRESSES with two columns: name and city.
>>>> 
>>>> 
>>>> 
>> jdbc.db().values(“people”).as(“x”).db().values(“addresses”).has(“name”,eq(path(“x”).by(“name”))).value(“city”)
>>>> 
>>>> The above is equivalent to:
>>>> 
>>>> SELECT city FROM people,addresses WHERE people.name=addresses.name
>>>> 
>>>> If you want to do an inner join (a product), you do this:
>>>> 
>>>> 
>>>> 
>> jdbc.db().values(“people”).as(“x”).db().values(“addresses”).has(“name”,eq(path(“x”).by(“name”))).as(“y”).path(“x”,”y")
>>>> 
>>>> The above is equivalent to:
>>>> 
>>>> SELECT * FROM addresses INNER JOIN people ON people.name=addresses.name
>>>> 
>>>> NOTES:
>>>>       1. Instead of select(), we simply jump to the root via db() (or
>>>> V() for graph).
>>>>       2. Instead of project(), we simply use value() or values().
>>>>       3. Instead of select() being overloaded with by() join syntax, we
>>>> use has() and path().
>>>>               - like TP3 we will be smart about dropping path() data
>>>> once its no longer referenced.
>>>>       4. We can also do LEFT and RIGHT JOINs (haven’t thought through
>>>> FULL OUTER JOIN yet).
>>>>               - however, we don’t support ‘null' in TP so I don’t know
>>>> if we want to support these null-producing joins. ?
>>>> 
>>>> LEFT JOIN:
>>>>       * If an address doesn’t exist for the person, emit a
>> “null”-filled
>>>> path.
>>>> 
>>>> jdbc.db().values(“people”).as(“x”).
>>>> db().values(“addresses”).as(“y”).
>>>>   choose(has(“name”,eq(path(“x”).by(“name”))),
>>>>     identity(),
>>>>     path(“y”).by(null).as(“y”)).
>>>> path(“x”,”y")
>>>> 
>>>> SELECT * FROM addresses LEFT JOIN people ON people.name=addresses.name
>>>> 
>>>> RIGHT JOIN:
>>>> 
>>>> jdbc.db().values(“people”).as(“x”).
>>>> db().values(“addresses”).as(“y”).
>>>>   choose(has(“name”,eq(path(“x”).by(“name”))),
>>>>     identity(),
>>>>     path(“x”).by(null).as(“x”)).
>>>> path(“x”,”y")
>>>> 
>>>> 
>>>> SUMMARY:
>>>> 
>>>> There are no “low level” instructions. Everything is based on the
>> standard
>>>> instructions that we know and love. Finally, if not apparent, the above
>>>> bytecode chunks would ultimately get strategized into a single SQL query
>>>> (breadth-first) instead of one-off queries (depth-first) to improve
>>>> performance.
>>>> 
>>>> Neat?,
>>>> Marko.
>>>> 
>>>> http://rredux.com <http://rredux.com/> <http://rredux.com/ <http://rredux.com/>> <http://rredux.com/ <http://rredux.com/> <
>> http://rredux.com/ <http://rredux.com/>>>


Re: The Fundamental Structure Instructions Already Exist! (w/ RDBMS Example)

Posted by Joshua Shinavier <jo...@fortytwo.net>.
Hi Marko,

I like it. But I still have some constructive criticism. I think a little
more simplicity in the right places will make things like index support,
query optimization, and integration with SEDMs (someone else's data model)
that much easier in the future.

First, the "root". While we do need context for traversals, I don't think
there should be a distinct kind of root for each kind of structure. Once
again, select(), or operations derived from select() will work just fine.
Want the "person" table? db.select("person"). Want a sequence of vertices
with the label "person"? db.select("person"). What we are saying in either
case is "give me the 'person' relation. Don't project any specific fields;
just give me all the data". A relational DB and a property graph DB will
have different ways of supplying the relation, but in either case, it can
hide behind the same interface (TRelation?).

But wait, you say, what if the under the hood, you have a TTable in one
case, and TSequence in the other? They are so different! That's why
the Dataflow
Model
<https://storage.googleapis.com/pub-tools-public-publication-data/pdf/43864.pdf>
is so great; to an extent, you can think of the two as interchangeable. I
think we would get a lot of mileage out of treating them as interchangeable
within TP4.

So instead of a data model -specific "root", I argue for a universal root
together with a set of relations and what we might call an "indexes". An
index is an arrow from a type to a relation which says "give me a
column/value pair, and I will give you all matching tuples from this
relation". The result is another relation. Where data sources differentiate
themselves is by having different relations and indexes.

For example, if the underlying data structure is nothing but a stream of
Trip tuples, you will have a single relation "Trip", and no indexes. Sorry;
you just have to wait for tuples to go by, and filter on them. So if you
say d.select("Trip", "driver") -- where d is a traversal that gets you to a
User -- the machine knows that it can't use "driver" to look up a specific
set of trips; it has to use a filter over all future "Trip" tuples. If, on
the other hand, we have a relational database, we have the option of
indexing on "driver". In this case, d.select("Trip", "driver") may take you
to a specific table like "Trip_by_driver" which has "driver" as a primary
key. The machine recognizes that this index exists, and uses it to answer
the query more efficiently. The alternative is to do a full scan over any
table which contains the "Trip" relation. Since TinkerPop3, we have been
without a vendor-neutral API for indexes, but this is where such an API
would really start to shine. Consider Neo4j's single property indexes,
JanusGraph's composite indexes, and even RDF triple indices (spo, ops,
etc.) as in AllegroGraph in addition to primary keys in relational
databases.

TTuple -- cool. +1

"Enums" -- I agree that enums are necessary, but we need even more: tagged
unions <https://en.wikipedia.org/wiki/Tagged_union>. They are part of the
system of algebraic data types which I described on Friday. An enum is a
special case of a tagged union in which there is no value, just a type tag.
May I suggest something like TValue, which contains a value (possibly
trivial) together with a type tag. This enables ORs and pattern matching.
For example, suppose "created" edges are allowed to point to either
"Project" or "Document" vertices. The in-type of "created" is
union{project:Project, document:Document). Now the in value of a specific
edge can be TValue("project", [some project vertex]) or TValue("document",
[some document vertex]) and you have the freedom to switch on the type tag
if you want to, e.g. the next step in the traversal can give you the "name"
of the project or the "title" of the document as appropriate.

Multi-properties -- agreed; has() is good enough.

Meta-properties -- again, this is where I think we should have a
lower-level select() operation. Then has() builds on that operation.
Whereas select() matches on fields of a relation, has() matches on property
values and other higher-order things. If you want properties of properties,
don't use has(); use select()/from(). Most of the time, you will just want
to use has().

Agreed that every *entity* should have an id(), and also a label() (though
it should always be possible to infer label() from the context). I would
suggest TEntity (or TElement), which has id(), label(), and value(), where
value() provides the raw value (usually a TTuple) of the entity.

Josh



On Mon, Apr 29, 2019 at 10:35 AM Marko Rodriguez <ok...@gmail.com>
wrote:

> Hello Josh,
>
> > A has("age",29), for example, operates at a different level of
> abstraction than a
> > has("city","Santa Fe") if "city" is a column in an "addresses" table.
>
> So hasXXX() operators work on TTuples. Thus:
>
> g.V().hasLabel(‘person’).has(‘age’,29)
> g.V().hasLabel(‘address’).has(‘city’,’Santa Fe’)
>
> ..both work as a person-vertex and an address-vertex are TTuples. If these
> were tables, then:
>
> jdbc.db().values(‘people’).has(‘age’,29)
> jdbc.db().values(‘addresses’).has(‘city’,’Santa Fe’)
>
> …also works as both people and addresses are TTables which extend
> TTuple<String,?>.
>
> In summary, its its a TTuple, then hasXXX() is good go.
>
> ////////// IGNORE UNTIL AFTER READING NEXT SECTION //////////
> *** SIDENOTE: A TTable (which is a TSequence) could have Symbol-based
> metadata. Thus TTable.value(#label) -> “people.” If so, then
> jdbc.db().hasLabel(“people”).has(“age”,29)
>
> > At least, they
> > are different if the data model allows for multi-properties,
> > meta-properties, and hyper-edges. A property is something that can either
> > be there, attached to an element, or not be there. There may also be more
> > than one such property, and it may have other properties attached to it.
> A
> > column of a table, on the other hand, is always there (even if its value
> is
> > allowed to be null), always has a single value, and cannot have further
> > properties attached.
>
> 1. Multi-properties.
>
> Multi-properties works because if name references a TSequence, then its
> the sequence that you analyze with has(). This is another reason why
> TSequence is important. Its a reference to a “stream” so there isn’t
> another layer of tuple-nesting.
>
> // assume v[1] has name={marko,mrodriguez,markor}
> g.V(1).value(‘name’) => TSequence<String>
> g.V(1).values(‘name’) => marko, mrodriguez, markor
> g.V(1).has(‘name’,’marko’) => v[1]
>
> 2. Meta-properties
>
> // assume v[1] has name=[value:marko,creator:josh,timestamp:12303] // i.e.
> a tuple value
> g.V(1).value(‘name’) => TTuple<?,String> // doh!
> g.V(1).value(‘name’).value(‘value’) => marko
> g.V(1).value(‘name’).value(‘creator’) => josh
>
> So things get screwy. — however, it only gets screwy when you mix your
> “metadata” key/values with your “data” key/values. This is why I think
> TSymbols are important. Imagine the following meta-property tuple for v[1]:
>
> [#value:marko,creator:josh,timestamp:12303]
>
> If you do g.V(1).value(‘name’), we could look to the value indexed by the
> symbol #value, thus => “marko”.
> If you do g.V(1).values(‘name’), you would get back a TSequence with a
> single TTuple being the meta property.
> If you do g.V(1).values(‘name’).value(), we could get the value indexed by
> the symbol #value.
> If you do g.V(1).values(‘name’).value(‘creator’), it will return the
> primitive string “josh”.
>
> I believe that the following symbols should be recommended for use across
> all data structures.
>         #id, #label, #key, #value
> …where id(), label(), key(), value() are tuple.get(Symbol). Other symbols
> for use with propertygraph/ include:
>         #outE, #inV, #inE, #outV, #bothE, #bothV
>
> > In order to simplify user queries, you can let has() and values() do
> double
> > duty, but I still feel that there are lower-level operations at play, at
> a
> > logical level even if not at a bytecode level. However, expressing the a
> > traversal in terms of its lowest-level relational operations may also be
> > useful for query optimization.
>
> One thing that I’m doing, that perhaps you haven’t caught onto yet, is
> that I’m not modeling everything in terms of “tables.” Each data structure
> is trying to stay as pure to its conceptual model as possible. Thus, there
> are no “joins” in property graphs as outE() references a TSequence<TEdge>,
> where TEdge is an interface that extends TTuple. You can just walk without
> doing any type of INNER JOIN. Now, if you model a property graph in a
> relational database, you will have to strategize the bytecode accordingly!
> Just a heads up in case you haven’t noticed that.
>
> Thanks for your input,
> Marko.
>
> http://rredux.com <http://rredux.com/>
>
>
>
> >
> > Josh
> >
> >
> >
> > On Mon, Apr 29, 2019 at 7:34 AM Marko Rodriguez <okrammarko@gmail.com
> <ma...@gmail.com>>
> > wrote:
> >
> >> Hi,
> >>
> >> *** This email is primarily for Josh (and Kuppitz). However, if others
> are
> >> interested… ***
> >>
> >> So I did a lot of thinking this weekend about structure/ and this
> morning,
> >> I prototyped both graph/ and rdbms/.
> >>
> >> This is the way I’m currently thinking of things:
> >>
> >>        1. There are 4 base types in structure/.
> >>                - Primitive: string, long, float, int, … (will constrain
> >> these at some point).
> >>                - TTuple<K,V>: key/value map.
> >>                - TSequence<V>: an iterable of v objects.
> >>                - TSymbol: like Ruby, I think we need “enum-like” symbols
> >> (e.g., #id, #label).
> >>
> >>        2. Every structure has a “root.”
> >>                - for graph its TGraph implements TSequence<TVertex>
> >>                - for rdbms its a TDatabase implements
> >> TTuple<String,TTable>
> >>
> >>        3. Roots implement Structure and thus, are what is generated by
> >> StructureFactory.mint().
> >>                - defined using withStructure().
> >>                - For graph, its accessible via V().
> >>                - For rdbms, its accessible via db().
> >>
> >>        4. There is a list of core instructions for dealing with these
> >> base objects.
> >>                - value(K key): gets the TTuple value for the provided
> key.
> >>                - values(K key): gets an iterator of the value for the
> >> provided key.
> >>                - entries(): gets an iterator of T2Tuple objects for the
> >> incoming TTuple.
> >>                - hasXXX(A,B): various has()-based filters for looking
> >> into a TTuple and a TSequence
> >>                - db()/V()/etc.: jump to the “root” of the
> withStructure()
> >> structure.
> >>                - drop()/add(): behave as one would expect and thus.
> >>
> >> ————
> >>
> >> For RDBMS, we have three interfaces in rdbms/.
> >> (machine/machine-core/structure/rdbms)
> >>
> >>        1. TDatabase implements TTuple<String,TTable> // the root
> >> structure that indexes the tables.
> >>        2. TTable implements TSequence<TRow<?>> // a table is a sequence
> >> of rows
> >>        3. TRow<V> implements TTuple<String,V>> // a row has string
> column
> >> names
> >>
> >> I then created a new project at machine/structure/jdbc). The classes in
> >> here implement the above rdbms/ interfaces/
> >>
> >> Here is an RDBMS session:
> >>
> >> final Machine machine = LocalMachine.open();
> >> final TraversalSource jdbc =
> >>        Gremlin.traversal(machine).
> >>                        withProcessor(PipesProcessor.class).
> >>                        withStructure(JDBCStructure.class,
> >> Map.of(JDBCStructure.JDBC_CONNECTION, "jdbc:h2:/tmp/test"));
> >>
> >> System.out.println(jdbc.db().toList());
> >> System.out.println(jdbc.db().entries().toList());
> >> System.out.println(jdbc.db().value("people").toList());
> >> System.out.println(jdbc.db().values("people").toList());
> >> System.out.println(jdbc.db().values("people").value("name").toList());
> >> System.out.println(jdbc.db().values("people").entries().toList());
> >>
> >> This yields:
> >>
> >> [<database#conn1: url=jdbc:h2:/tmp/test user=>]
> >> [PEOPLE:<table#PEOPLE>]
> >> [<table#people>]
> >> [<row#PEOPLE:1>, <row#PEOPLE:2>]
> >> [marko, josh]
> >> [NAME:marko, AGE:29, NAME:josh, AGE:32]
> >>
> >> The bytecode of the last query is:
> >>
> >> [db(<database#conn1: url=jdbc:h2:/tmp/test user=>), values(people),
> >> entries]
> >>
> >> JDBCDatabase implements TDatabase, Structure.
> >>        *** JDBCDatabase is the root structure and is referenced by db()
> >> *** (CRUCIAL POINT)
> >>
> >> Assume another table called ADDRESSES with two columns: name and city.
> >>
> >>
> >>
> jdbc.db().values(“people”).as(“x”).db().values(“addresses”).has(“name”,eq(path(“x”).by(“name”))).value(“city”)
> >>
> >> The above is equivalent to:
> >>
> >> SELECT city FROM people,addresses WHERE people.name=addresses.name
> >>
> >> If you want to do an inner join (a product), you do this:
> >>
> >>
> >>
> jdbc.db().values(“people”).as(“x”).db().values(“addresses”).has(“name”,eq(path(“x”).by(“name”))).as(“y”).path(“x”,”y")
> >>
> >> The above is equivalent to:
> >>
> >> SELECT * FROM addresses INNER JOIN people ON people.name=addresses.name
> >>
> >> NOTES:
> >>        1. Instead of select(), we simply jump to the root via db() (or
> >> V() for graph).
> >>        2. Instead of project(), we simply use value() or values().
> >>        3. Instead of select() being overloaded with by() join syntax, we
> >> use has() and path().
> >>                - like TP3 we will be smart about dropping path() data
> >> once its no longer referenced.
> >>        4. We can also do LEFT and RIGHT JOINs (haven’t thought through
> >> FULL OUTER JOIN yet).
> >>                - however, we don’t support ‘null' in TP so I don’t know
> >> if we want to support these null-producing joins. ?
> >>
> >> LEFT JOIN:
> >>        * If an address doesn’t exist for the person, emit a
> “null”-filled
> >> path.
> >>
> >> jdbc.db().values(“people”).as(“x”).
> >>  db().values(“addresses”).as(“y”).
> >>    choose(has(“name”,eq(path(“x”).by(“name”))),
> >>      identity(),
> >>      path(“y”).by(null).as(“y”)).
> >>  path(“x”,”y")
> >>
> >> SELECT * FROM addresses LEFT JOIN people ON people.name=addresses.name
> >>
> >> RIGHT JOIN:
> >>
> >> jdbc.db().values(“people”).as(“x”).
> >>  db().values(“addresses”).as(“y”).
> >>    choose(has(“name”,eq(path(“x”).by(“name”))),
> >>      identity(),
> >>      path(“x”).by(null).as(“x”)).
> >>  path(“x”,”y")
> >>
> >>
> >> SUMMARY:
> >>
> >> There are no “low level” instructions. Everything is based on the
> standard
> >> instructions that we know and love. Finally, if not apparent, the above
> >> bytecode chunks would ultimately get strategized into a single SQL query
> >> (breadth-first) instead of one-off queries (depth-first) to improve
> >> performance.
> >>
> >> Neat?,
> >> Marko.
> >>
> >> http://rredux.com <http://rredux.com/> <http://rredux.com/ <
> http://rredux.com/>>
>
>

Re: The Fundamental Structure Instructions Already Exist! (w/ RDBMS Example)

Posted by Marko Rodriguez <ok...@gmail.com>.
Hello Josh,

> A has("age",29), for example, operates at a different level of abstraction than a
> has("city","Santa Fe") if "city" is a column in an "addresses" table.

So hasXXX() operators work on TTuples. Thus:

g.V().hasLabel(‘person’).has(‘age’,29)
g.V().hasLabel(‘address’).has(‘city’,’Santa Fe’)

..both work as a person-vertex and an address-vertex are TTuples. If these were tables, then:

jdbc.db().values(‘people’).has(‘age’,29)
jdbc.db().values(‘addresses’).has(‘city’,’Santa Fe’)

…also works as both people and addresses are TTables which extend TTuple<String,?>.

In summary, its its a TTuple, then hasXXX() is good go.

////////// IGNORE UNTIL AFTER READING NEXT SECTION //////////
*** SIDENOTE: A TTable (which is a TSequence) could have Symbol-based metadata. Thus TTable.value(#label) -> “people.” If so, then
jdbc.db().hasLabel(“people”).has(“age”,29)

> At least, they
> are different if the data model allows for multi-properties,
> meta-properties, and hyper-edges. A property is something that can either
> be there, attached to an element, or not be there. There may also be more
> than one such property, and it may have other properties attached to it. A
> column of a table, on the other hand, is always there (even if its value is
> allowed to be null), always has a single value, and cannot have further
> properties attached.

1. Multi-properties.

Multi-properties works because if name references a TSequence, then its the sequence that you analyze with has(). This is another reason why TSequence is important. Its a reference to a “stream” so there isn’t another layer of tuple-nesting.

// assume v[1] has name={marko,mrodriguez,markor}
g.V(1).value(‘name’) => TSequence<String>
g.V(1).values(‘name’) => marko, mrodriguez, markor
g.V(1).has(‘name’,’marko’) => v[1]

2. Meta-properties

// assume v[1] has name=[value:marko,creator:josh,timestamp:12303] // i.e. a tuple value
g.V(1).value(‘name’) => TTuple<?,String> // doh!
g.V(1).value(‘name’).value(‘value’) => marko
g.V(1).value(‘name’).value(‘creator’) => josh

So things get screwy. — however, it only gets screwy when you mix your “metadata” key/values with your “data” key/values. This is why I think TSymbols are important. Imagine the following meta-property tuple for v[1]:

[#value:marko,creator:josh,timestamp:12303]

If you do g.V(1).value(‘name’), we could look to the value indexed by the symbol #value, thus => “marko”.
If you do g.V(1).values(‘name’), you would get back a TSequence with a single TTuple being the meta property.
If you do g.V(1).values(‘name’).value(), we could get the value indexed by the symbol #value.
If you do g.V(1).values(‘name’).value(‘creator’), it will return the primitive string “josh”.

I believe that the following symbols should be recommended for use across all data structures.
	#id, #label, #key, #value
…where id(), label(), key(), value() are tuple.get(Symbol). Other symbols for use with propertygraph/ include:
	#outE, #inV, #inE, #outV, #bothE, #bothV

> In order to simplify user queries, you can let has() and values() do double
> duty, but I still feel that there are lower-level operations at play, at a
> logical level even if not at a bytecode level. However, expressing the a
> traversal in terms of its lowest-level relational operations may also be
> useful for query optimization.

One thing that I’m doing, that perhaps you haven’t caught onto yet, is that I’m not modeling everything in terms of “tables.” Each data structure is trying to stay as pure to its conceptual model as possible. Thus, there are no “joins” in property graphs as outE() references a TSequence<TEdge>, where TEdge is an interface that extends TTuple. You can just walk without doing any type of INNER JOIN. Now, if you model a property graph in a relational database, you will have to strategize the bytecode accordingly! Just a heads up in case you haven’t noticed that.

Thanks for your input,
Marko.

http://rredux.com <http://rredux.com/>



> 
> Josh
> 
> 
> 
> On Mon, Apr 29, 2019 at 7:34 AM Marko Rodriguez <okrammarko@gmail.com <ma...@gmail.com>>
> wrote:
> 
>> Hi,
>> 
>> *** This email is primarily for Josh (and Kuppitz). However, if others are
>> interested… ***
>> 
>> So I did a lot of thinking this weekend about structure/ and this morning,
>> I prototyped both graph/ and rdbms/.
>> 
>> This is the way I’m currently thinking of things:
>> 
>>        1. There are 4 base types in structure/.
>>                - Primitive: string, long, float, int, … (will constrain
>> these at some point).
>>                - TTuple<K,V>: key/value map.
>>                - TSequence<V>: an iterable of v objects.
>>                - TSymbol: like Ruby, I think we need “enum-like” symbols
>> (e.g., #id, #label).
>> 
>>        2. Every structure has a “root.”
>>                - for graph its TGraph implements TSequence<TVertex>
>>                - for rdbms its a TDatabase implements
>> TTuple<String,TTable>
>> 
>>        3. Roots implement Structure and thus, are what is generated by
>> StructureFactory.mint().
>>                - defined using withStructure().
>>                - For graph, its accessible via V().
>>                - For rdbms, its accessible via db().
>> 
>>        4. There is a list of core instructions for dealing with these
>> base objects.
>>                - value(K key): gets the TTuple value for the provided key.
>>                - values(K key): gets an iterator of the value for the
>> provided key.
>>                - entries(): gets an iterator of T2Tuple objects for the
>> incoming TTuple.
>>                - hasXXX(A,B): various has()-based filters for looking
>> into a TTuple and a TSequence
>>                - db()/V()/etc.: jump to the “root” of the withStructure()
>> structure.
>>                - drop()/add(): behave as one would expect and thus.
>> 
>> ————
>> 
>> For RDBMS, we have three interfaces in rdbms/.
>> (machine/machine-core/structure/rdbms)
>> 
>>        1. TDatabase implements TTuple<String,TTable> // the root
>> structure that indexes the tables.
>>        2. TTable implements TSequence<TRow<?>> // a table is a sequence
>> of rows
>>        3. TRow<V> implements TTuple<String,V>> // a row has string column
>> names
>> 
>> I then created a new project at machine/structure/jdbc). The classes in
>> here implement the above rdbms/ interfaces/
>> 
>> Here is an RDBMS session:
>> 
>> final Machine machine = LocalMachine.open();
>> final TraversalSource jdbc =
>>        Gremlin.traversal(machine).
>>                        withProcessor(PipesProcessor.class).
>>                        withStructure(JDBCStructure.class,
>> Map.of(JDBCStructure.JDBC_CONNECTION, "jdbc:h2:/tmp/test"));
>> 
>> System.out.println(jdbc.db().toList());
>> System.out.println(jdbc.db().entries().toList());
>> System.out.println(jdbc.db().value("people").toList());
>> System.out.println(jdbc.db().values("people").toList());
>> System.out.println(jdbc.db().values("people").value("name").toList());
>> System.out.println(jdbc.db().values("people").entries().toList());
>> 
>> This yields:
>> 
>> [<database#conn1: url=jdbc:h2:/tmp/test user=>]
>> [PEOPLE:<table#PEOPLE>]
>> [<table#people>]
>> [<row#PEOPLE:1>, <row#PEOPLE:2>]
>> [marko, josh]
>> [NAME:marko, AGE:29, NAME:josh, AGE:32]
>> 
>> The bytecode of the last query is:
>> 
>> [db(<database#conn1: url=jdbc:h2:/tmp/test user=>), values(people),
>> entries]
>> 
>> JDBCDatabase implements TDatabase, Structure.
>>        *** JDBCDatabase is the root structure and is referenced by db()
>> *** (CRUCIAL POINT)
>> 
>> Assume another table called ADDRESSES with two columns: name and city.
>> 
>> 
>> jdbc.db().values(“people”).as(“x”).db().values(“addresses”).has(“name”,eq(path(“x”).by(“name”))).value(“city”)
>> 
>> The above is equivalent to:
>> 
>> SELECT city FROM people,addresses WHERE people.name=addresses.name
>> 
>> If you want to do an inner join (a product), you do this:
>> 
>> 
>> jdbc.db().values(“people”).as(“x”).db().values(“addresses”).has(“name”,eq(path(“x”).by(“name”))).as(“y”).path(“x”,”y")
>> 
>> The above is equivalent to:
>> 
>> SELECT * FROM addresses INNER JOIN people ON people.name=addresses.name
>> 
>> NOTES:
>>        1. Instead of select(), we simply jump to the root via db() (or
>> V() for graph).
>>        2. Instead of project(), we simply use value() or values().
>>        3. Instead of select() being overloaded with by() join syntax, we
>> use has() and path().
>>                - like TP3 we will be smart about dropping path() data
>> once its no longer referenced.
>>        4. We can also do LEFT and RIGHT JOINs (haven’t thought through
>> FULL OUTER JOIN yet).
>>                - however, we don’t support ‘null' in TP so I don’t know
>> if we want to support these null-producing joins. ?
>> 
>> LEFT JOIN:
>>        * If an address doesn’t exist for the person, emit a “null”-filled
>> path.
>> 
>> jdbc.db().values(“people”).as(“x”).
>>  db().values(“addresses”).as(“y”).
>>    choose(has(“name”,eq(path(“x”).by(“name”))),
>>      identity(),
>>      path(“y”).by(null).as(“y”)).
>>  path(“x”,”y")
>> 
>> SELECT * FROM addresses LEFT JOIN people ON people.name=addresses.name
>> 
>> RIGHT JOIN:
>> 
>> jdbc.db().values(“people”).as(“x”).
>>  db().values(“addresses”).as(“y”).
>>    choose(has(“name”,eq(path(“x”).by(“name”))),
>>      identity(),
>>      path(“x”).by(null).as(“x”)).
>>  path(“x”,”y")
>> 
>> 
>> SUMMARY:
>> 
>> There are no “low level” instructions. Everything is based on the standard
>> instructions that we know and love. Finally, if not apparent, the above
>> bytecode chunks would ultimately get strategized into a single SQL query
>> (breadth-first) instead of one-off queries (depth-first) to improve
>> performance.
>> 
>> Neat?,
>> Marko.
>> 
>> http://rredux.com <http://rredux.com/> <http://rredux.com/ <http://rredux.com/>>


Re: The Fundamental Structure Instructions Already Exist! (w/ RDBMS Example)

Posted by Joshua Shinavier <jo...@fortytwo.net>.
Hi Marko,

I will respond in more detail tomorrow (I'm a late-night-thinking,
early-morning-writing kind of guy) but yes I think this is cool, so long as
we are not overloading the steps with different levels of abstraction.
A has("age",
29), for example, operates at a different level of abstraction than a
has("city",
"Santa Fe") if "city" is a column in an "addresses" table. At least, they
are different if the data model allows for multi-properties,
meta-properties, and hyper-edges. A property is something that can either
be there, attached to an element, or not be there. There may also be more
than one such property, and it may have other properties attached to it. A
column of a table, on the other hand, is always there (even if its value is
allowed to be null), always has a single value, and cannot have further
properties attached. The same goes for values().

In order to simplify user queries, you can let has() and values() do double
duty, but I still feel that there are lower-level operations at play, at a
logical level even if not at a bytecode level. However, expressing the a
traversal in terms of its lowest-level relational operations may also be
useful for query optimization.

Josh



On Mon, Apr 29, 2019 at 7:34 AM Marko Rodriguez <ok...@gmail.com>
wrote:

> Hi,
>
> *** This email is primarily for Josh (and Kuppitz). However, if others are
> interested… ***
>
> So I did a lot of thinking this weekend about structure/ and this morning,
> I prototyped both graph/ and rdbms/.
>
> This is the way I’m currently thinking of things:
>
>         1. There are 4 base types in structure/.
>                 - Primitive: string, long, float, int, … (will constrain
> these at some point).
>                 - TTuple<K,V>: key/value map.
>                 - TSequence<V>: an iterable of v objects.
>                 - TSymbol: like Ruby, I think we need “enum-like” symbols
> (e.g., #id, #label).
>
>         2. Every structure has a “root.”
>                 - for graph its TGraph implements TSequence<TVertex>
>                 - for rdbms its a TDatabase implements
> TTuple<String,TTable>
>
>         3. Roots implement Structure and thus, are what is generated by
> StructureFactory.mint().
>                 - defined using withStructure().
>                 - For graph, its accessible via V().
>                 - For rdbms, its accessible via db().
>
>         4. There is a list of core instructions for dealing with these
> base objects.
>                 - value(K key): gets the TTuple value for the provided key.
>                 - values(K key): gets an iterator of the value for the
> provided key.
>                 - entries(): gets an iterator of T2Tuple objects for the
> incoming TTuple.
>                 - hasXXX(A,B): various has()-based filters for looking
> into a TTuple and a TSequence
>                 - db()/V()/etc.: jump to the “root” of the withStructure()
> structure.
>                 - drop()/add(): behave as one would expect and thus.
>
> ————
>
> For RDBMS, we have three interfaces in rdbms/.
> (machine/machine-core/structure/rdbms)
>
>         1. TDatabase implements TTuple<String,TTable> // the root
> structure that indexes the tables.
>         2. TTable implements TSequence<TRow<?>> // a table is a sequence
> of rows
>         3. TRow<V> implements TTuple<String,V>> // a row has string column
> names
>
> I then created a new project at machine/structure/jdbc). The classes in
> here implement the above rdbms/ interfaces/
>
> Here is an RDBMS session:
>
> final Machine machine = LocalMachine.open();
> final TraversalSource jdbc =
>         Gremlin.traversal(machine).
>                         withProcessor(PipesProcessor.class).
>                         withStructure(JDBCStructure.class,
> Map.of(JDBCStructure.JDBC_CONNECTION, "jdbc:h2:/tmp/test"));
>
> System.out.println(jdbc.db().toList());
> System.out.println(jdbc.db().entries().toList());
> System.out.println(jdbc.db().value("people").toList());
> System.out.println(jdbc.db().values("people").toList());
> System.out.println(jdbc.db().values("people").value("name").toList());
> System.out.println(jdbc.db().values("people").entries().toList());
>
> This yields:
>
> [<database#conn1: url=jdbc:h2:/tmp/test user=>]
> [PEOPLE:<table#PEOPLE>]
> [<table#people>]
> [<row#PEOPLE:1>, <row#PEOPLE:2>]
> [marko, josh]
> [NAME:marko, AGE:29, NAME:josh, AGE:32]
>
> The bytecode of the last query is:
>
> [db(<database#conn1: url=jdbc:h2:/tmp/test user=>), values(people),
> entries]
>
> JDBCDatabase implements TDatabase, Structure.
>         *** JDBCDatabase is the root structure and is referenced by db()
> *** (CRUCIAL POINT)
>
> Assume another table called ADDRESSES with two columns: name and city.
>
>
> jdbc.db().values(“people”).as(“x”).db().values(“addresses”).has(“name”,eq(path(“x”).by(“name”))).value(“city”)
>
> The above is equivalent to:
>
> SELECT city FROM people,addresses WHERE people.name=addresses.name
>
> If you want to do an inner join (a product), you do this:
>
>
> jdbc.db().values(“people”).as(“x”).db().values(“addresses”).has(“name”,eq(path(“x”).by(“name”))).as(“y”).path(“x”,”y")
>
> The above is equivalent to:
>
> SELECT * FROM addresses INNER JOIN people ON people.name=addresses.name
>
> NOTES:
>         1. Instead of select(), we simply jump to the root via db() (or
> V() for graph).
>         2. Instead of project(), we simply use value() or values().
>         3. Instead of select() being overloaded with by() join syntax, we
> use has() and path().
>                 - like TP3 we will be smart about dropping path() data
> once its no longer referenced.
>         4. We can also do LEFT and RIGHT JOINs (haven’t thought through
> FULL OUTER JOIN yet).
>                 - however, we don’t support ‘null' in TP so I don’t know
> if we want to support these null-producing joins. ?
>
> LEFT JOIN:
>         * If an address doesn’t exist for the person, emit a “null”-filled
> path.
>
> jdbc.db().values(“people”).as(“x”).
>   db().values(“addresses”).as(“y”).
>     choose(has(“name”,eq(path(“x”).by(“name”))),
>       identity(),
>       path(“y”).by(null).as(“y”)).
>   path(“x”,”y")
>
> SELECT * FROM addresses LEFT JOIN people ON people.name=addresses.name
>
> RIGHT JOIN:
>
> jdbc.db().values(“people”).as(“x”).
>   db().values(“addresses”).as(“y”).
>     choose(has(“name”,eq(path(“x”).by(“name”))),
>       identity(),
>       path(“x”).by(null).as(“x”)).
>   path(“x”,”y")
>
>
> SUMMARY:
>
> There are no “low level” instructions. Everything is based on the standard
> instructions that we know and love. Finally, if not apparent, the above
> bytecode chunks would ultimately get strategized into a single SQL query
> (breadth-first) instead of one-off queries (depth-first) to improve
> performance.
>
> Neat?,
> Marko.
>
> http://rredux.com <http://rredux.com/>
>
>
>
>
>

Re: The Fundamental Structure Instructions Already Exist! (w/ RDBMS Example)

Posted by Marko Rodriguez <ok...@gmail.com>.
Hey,

Check this out:

############################
Machine machine = LocalMachine.open();
TraversalSource jdbc =
                Gremlin.traversal(machine).
                        withProcessor(PipesProcessor.class).
                        withStructure(JDBCStructure.class, Map.of(JDBCStructure.JDBC_CONNECTION, "jdbc:h2:/tmp/test"));
  
System.out.println(jdbc.db().values("people").as("x”).
                        db().values("addresses").as("y").has("name", __.path("x").by("name")).
                          path("x", "y").toList());
System.out.println(“\n\n”)
System.out.println(jdbc.db().values("people").as("x”).
                        db().values("addresses").as("y").has("name", __.path("x").by("name")).
                          path("x", "y").explain().toList());
############################

[[{NAME=marko, AGE=29}, {CITY=santa fe, NAME=marko}], [{NAME=josh, AGE=32}, {CITY=san jose, NAME=josh}]]


[Original                       	[db, values(people)@x, db, values(addresses)@y, hasKeyValue(name,[path(x,[value(name)])]), path(x,y,|)]
JDBCStrategy                   		[db(<database#conn9: url=jdbc:h2:/tmp/test user=>), values(people)@x, db(<database#conn10: url=jdbc:h2:/tmp/test user=>), values(addresses)@y, hasKeyValue(name,[path(x,[value(name)])]), path(x,y,|)]
JDBCQueryStrategy              		[jdbc:sql(conn9: url=jdbc:h2:/tmp/test user=,x,y,SELECT x.*, y.* FROM people AS x, addresses AS y WHERE x.name=y.name)]
PipesStrategy                  		[jdbc:sql(conn9: url=jdbc:h2:/tmp/test user=,x,y,SELECT x.*, y.* FROM people AS x, addresses AS y WHERE x.name=y.name)]
CoefficientStrategy            		[jdbc:sql(conn9: url=jdbc:h2:/tmp/test user=,x,y,SELECT x.*, y.* FROM people AS x, addresses AS y WHERE x.name=y.name)]
CoefficientVerificationStrategy		[jdbc:sql(conn9: url=jdbc:h2:/tmp/test user=,x,y,SELECT x.*, y.* FROM people AS x, addresses AS y WHERE x.name=y.name)]
-------------------------------
Compilation                    		[FlatMapInitial]
Execution Plan [PipesProcessor]		[InitialStep[FlatMapInitial]]]





I basically look for a db.values.db.values.has-pattern in the bytecode and if I find it, I try and roll it into a single provider-specific instruction that does a SELECT query.

Here is JDBCQueryStrategy (its ghetto and error prone, but I just wanted to get the basic concept working):
	https://github.com/apache/tinkerpop/blob/7142dc16d8fc81ad8bd4090096b42e5b9b1744f4/java/machine/structure/jdbc/src/main/java/org/apache/tinkerpop/machine/structure/jdbc/strategy/JDBCQueryStrategy.java <https://github.com/apache/tinkerpop/blob/7142dc16d8fc81ad8bd4090096b42e5b9b1744f4/java/machine/structure/jdbc/src/main/java/org/apache/tinkerpop/machine/structure/jdbc/strategy/JDBCQueryStrategy.java>
Here is SqlFlatMapStep (hyper-ghetto… but whateva’):
	https://github.com/apache/tinkerpop/blob/7142dc16d8fc81ad8bd4090096b42e5b9b1744f4/java/machine/structure/jdbc/src/main/java/org/apache/tinkerpop/machine/structure/jdbc/function/flatmap/SqlFlatMap.java <https://github.com/apache/tinkerpop/blob/7142dc16d8fc81ad8bd4090096b42e5b9b1744f4/java/machine/structure/jdbc/src/main/java/org/apache/tinkerpop/machine/structure/jdbc/function/flatmap/SqlFlatMap.java>

Na na!,
Marko.

http://rredux.com <http://rredux.com/>




> On Apr 29, 2019, at 11:50 AM, Marko Rodriguez <ok...@gmail.com> wrote:
> 
> Hello Kuppitz,
> 
>> I don't think it's a good idea to keep this mindset for TP4; NULLs are too
>> important in RDBMS. I don't know, maybe you can convince SQL people that
>> dropping a value is the same as setting its value to NULL. It would work
>> for you and me and everybody else who's familiar with Gremlin, but SQL
>> people really love their NULLs….
> 
> Hmm……. I don’t like nulls. Perhaps with time a clever solution will emerge. ????
> 
>> I'd prefer to just have special accessors for these. E.g. g.V().meta("id").
>> At least valueMaps would then only have String-keys.
>> I see the issue with that (naming collisions), but it's still better than
>> the enums in my opinion (which became a pain when started to implement
>> GLVs).
> 
> So, TSymbols are not Java enums. They are simply a “primitive”-type that will have a serialization like:
> 
> 	symbol[id]
> 
> Meaning, that people can make up Symbols all day long without having to update serializers. How I see them working is that they are Strings prefixed with #.
> 
> g.V().outE()             <=>   g.V().values(“#outE”)
> g.V().id()               <=>   g.V().value(“#id”)
> g.V().hasLabel(“person") <=>   g.V().has(“#label”,”person”)
> 
> Now that I type this out, perhaps we don’t even have a TSymbol-class. Instead, any String that starts with # is considered a symbol. Now watch this:
> 
> g.V().label()  <=>   g.V().value(“#label”)
> g.V().labels() <=>   g.V().values(“#label”)
> 
> In this way, we can support Neo4j multi-labels as a Neo4jVertex’s #label-Key references a TSequence<String>.
> 
> g.V(1).label() => TSequence<String>
> g.V(1).labels() => String, String, String, …
> g.V(1).label().add(“programmer”)
> g.V(1).label().drop(“person”)
> 
> So we could do “meta()”, but then you need respective “hasXXX”-meta() methods. I think #symbol is easiest .. ?
> 
>> Also, what I'm wondering about now: Have you thought about Stored
>> Procedures and Views in RDBMS? Views can be treated as tables, easy, but
>> what about stored procedures? SPs can be found in many more DBMS, would be
>> bad to not support them (or hack something ugly together later in the
>> development process).
> 
> I’m not super versed in RDBMS technology. Can you please explain to me how to create a StoreProcedure and the range of outputs a StoredProcedure produces? From there, I can try and “Bytecode-ize” it.
> 
> Thanks Kuppitz,
> Marko.
> 
> http://rredux.com <http://rredux.com/>
> 
> 
> 
> 
>> On Mon, Apr 29, 2019 at 7:34 AM Marko Rodriguez <okrammarko@gmail.com <ma...@gmail.com>>
>> wrote:
>> 
>>> Hi,
>>> 
>>> *** This email is primarily for Josh (and Kuppitz). However, if others are
>>> interested… ***
>>> 
>>> So I did a lot of thinking this weekend about structure/ and this morning,
>>> I prototyped both graph/ and rdbms/.
>>> 
>>> This is the way I’m currently thinking of things:
>>> 
>>>        1. There are 4 base types in structure/.
>>>                - Primitive: string, long, float, int, … (will constrain
>>> these at some point).
>>>                - TTuple<K,V>: key/value map.
>>>                - TSequence<V>: an iterable of v objects.
>>>                - TSymbol: like Ruby, I think we need “enum-like” symbols
>>> (e.g., #id, #label).
>>> 
>>>        2. Every structure has a “root.”
>>>                - for graph its TGraph implements TSequence<TVertex>
>>>                - for rdbms its a TDatabase implements
>>> TTuple<String,TTable>
>>> 
>>>        3. Roots implement Structure and thus, are what is generated by
>>> StructureFactory.mint().
>>>                - defined using withStructure().
>>>                - For graph, its accessible via V().
>>>                - For rdbms, its accessible via db().
>>> 
>>>        4. There is a list of core instructions for dealing with these
>>> base objects.
>>>                - value(K key): gets the TTuple value for the provided key.
>>>                - values(K key): gets an iterator of the value for the
>>> provided key.
>>>                - entries(): gets an iterator of T2Tuple objects for the
>>> incoming TTuple.
>>>                - hasXXX(A,B): various has()-based filters for looking
>>> into a TTuple and a TSequence
>>>                - db()/V()/etc.: jump to the “root” of the withStructure()
>>> structure.
>>>                - drop()/add(): behave as one would expect and thus.
>>> 
>>> ————
>>> 
>>> For RDBMS, we have three interfaces in rdbms/.
>>> (machine/machine-core/structure/rdbms)
>>> 
>>>        1. TDatabase implements TTuple<String,TTable> // the root
>>> structure that indexes the tables.
>>>        2. TTable implements TSequence<TRow<?>> // a table is a sequence
>>> of rows
>>>        3. TRow<V> implements TTuple<String,V>> // a row has string column
>>> names
>>> 
>>> I then created a new project at machine/structure/jdbc). The classes in
>>> here implement the above rdbms/ interfaces/
>>> 
>>> Here is an RDBMS session:
>>> 
>>> final Machine machine = LocalMachine.open();
>>> final TraversalSource jdbc =
>>>        Gremlin.traversal(machine).
>>>                        withProcessor(PipesProcessor.class).
>>>                        withStructure(JDBCStructure.class,
>>> Map.of(JDBCStructure.JDBC_CONNECTION, "jdbc:h2:/tmp/test"));
>>> 
>>> System.out.println(jdbc.db().toList());
>>> System.out.println(jdbc.db().entries().toList());
>>> System.out.println(jdbc.db().value("people").toList());
>>> System.out.println(jdbc.db().values("people").toList());
>>> System.out.println(jdbc.db().values("people").value("name").toList());
>>> System.out.println(jdbc.db().values("people").entries().toList());
>>> 
>>> This yields:
>>> 
>>> [<database#conn1: url=jdbc:h2:/tmp/test user=>]
>>> [PEOPLE:<table#PEOPLE>]
>>> [<table#people>]
>>> [<row#PEOPLE:1>, <row#PEOPLE:2>]
>>> [marko, josh]
>>> [NAME:marko, AGE:29, NAME:josh, AGE:32]
>>> 
>>> The bytecode of the last query is:
>>> 
>>> [db(<database#conn1: url=jdbc:h2:/tmp/test user=>), values(people),
>>> entries]
>>> 
>>> JDBCDatabase implements TDatabase, Structure.
>>>        *** JDBCDatabase is the root structure and is referenced by db()
>>> *** (CRUCIAL POINT)
>>> 
>>> Assume another table called ADDRESSES with two columns: name and city.
>>> 
>>> 
>>> jdbc.db().values(“people”).as(“x”).db().values(“addresses”).has(“name”,eq(path(“x”).by(“name”))).value(“city”)
>>> 
>>> The above is equivalent to:
>>> 
>>> SELECT city FROM people,addresses WHERE people.name=addresses.name
>>> 
>>> If you want to do an inner join (a product), you do this:
>>> 
>>> 
>>> jdbc.db().values(“people”).as(“x”).db().values(“addresses”).has(“name”,eq(path(“x”).by(“name”))).as(“y”).path(“x”,”y")
>>> 
>>> The above is equivalent to:
>>> 
>>> SELECT * FROM addresses INNER JOIN people ON people.name=addresses.name
>>> 
>>> NOTES:
>>>        1. Instead of select(), we simply jump to the root via db() (or
>>> V() for graph).
>>>        2. Instead of project(), we simply use value() or values().
>>>        3. Instead of select() being overloaded with by() join syntax, we
>>> use has() and path().
>>>                - like TP3 we will be smart about dropping path() data
>>> once its no longer referenced.
>>>        4. We can also do LEFT and RIGHT JOINs (haven’t thought through
>>> FULL OUTER JOIN yet).
>>>                - however, we don’t support ‘null' in TP so I don’t know
>>> if we want to support these null-producing joins. ?
>>> 
>>> LEFT JOIN:
>>>        * If an address doesn’t exist for the person, emit a “null”-filled
>>> path.
>>> 
>>> jdbc.db().values(“people”).as(“x”).
>>>  db().values(“addresses”).as(“y”).
>>>    choose(has(“name”,eq(path(“x”).by(“name”))),
>>>      identity(),
>>>      path(“y”).by(null).as(“y”)).
>>>  path(“x”,”y")
>>> 
>>> SELECT * FROM addresses LEFT JOIN people ON people.name=addresses.name
>>> 
>>> RIGHT JOIN:
>>> 
>>> jdbc.db().values(“people”).as(“x”).
>>>  db().values(“addresses”).as(“y”).
>>>    choose(has(“name”,eq(path(“x”).by(“name”))),
>>>      identity(),
>>>      path(“x”).by(null).as(“x”)).
>>>  path(“x”,”y")
>>> 
>>> 
>>> SUMMARY:
>>> 
>>> There are no “low level” instructions. Everything is based on the standard
>>> instructions that we know and love. Finally, if not apparent, the above
>>> bytecode chunks would ultimately get strategized into a single SQL query
>>> (breadth-first) instead of one-off queries (depth-first) to improve
>>> performance.
>>> 
>>> Neat?,
>>> Marko.
>>> 
>>> http://rredux.com <http://rredux.com/> <http://rredux.com/ <http://rredux.com/>>
> 


Re: The Fundamental Structure Instructions Already Exist! (w/ RDBMS Example)

Posted by Marko Rodriguez <ok...@gmail.com>.
Hello Kuppitz,

> I don't think it's a good idea to keep this mindset for TP4; NULLs are too
> important in RDBMS. I don't know, maybe you can convince SQL people that
> dropping a value is the same as setting its value to NULL. It would work
> for you and me and everybody else who's familiar with Gremlin, but SQL
> people really love their NULLs….

Hmm……. I don’t like nulls. Perhaps with time a clever solution will emerge. ????

> I'd prefer to just have special accessors for these. E.g. g.V().meta("id").
> At least valueMaps would then only have String-keys.
> I see the issue with that (naming collisions), but it's still better than
> the enums in my opinion (which became a pain when started to implement
> GLVs).

So, TSymbols are not Java enums. They are simply a “primitive”-type that will have a serialization like:

	symbol[id]

Meaning, that people can make up Symbols all day long without having to update serializers. How I see them working is that they are Strings prefixed with #.

g.V().outE()             <=>   g.V().values(“#outE”)
g.V().id()               <=>   g.V().value(“#id”)
g.V().hasLabel(“person") <=>   g.V().has(“#label”,”person”)

Now that I type this out, perhaps we don’t even have a TSymbol-class. Instead, any String that starts with # is considered a symbol. Now watch this:

g.V().label()  <=>   g.V().value(“#label”)
g.V().labels() <=>   g.V().values(“#label”)

In this way, we can support Neo4j multi-labels as a Neo4jVertex’s #label-Key references a TSequence<String>.

g.V(1).label() => TSequence<String>
g.V(1).labels() => String, String, String, …
g.V(1).label().add(“programmer”)
g.V(1).label().drop(“person”)

So we could do “meta()”, but then you need respective “hasXXX”-meta() methods. I think #symbol is easiest .. ?

> Also, what I'm wondering about now: Have you thought about Stored
> Procedures and Views in RDBMS? Views can be treated as tables, easy, but
> what about stored procedures? SPs can be found in many more DBMS, would be
> bad to not support them (or hack something ugly together later in the
> development process).

I’m not super versed in RDBMS technology. Can you please explain to me how to create a StoreProcedure and the range of outputs a StoredProcedure produces? From there, I can try and “Bytecode-ize” it.

Thanks Kuppitz,
Marko.

http://rredux.com <http://rredux.com/>




> On Mon, Apr 29, 2019 at 7:34 AM Marko Rodriguez <okrammarko@gmail.com <ma...@gmail.com>>
> wrote:
> 
>> Hi,
>> 
>> *** This email is primarily for Josh (and Kuppitz). However, if others are
>> interested… ***
>> 
>> So I did a lot of thinking this weekend about structure/ and this morning,
>> I prototyped both graph/ and rdbms/.
>> 
>> This is the way I’m currently thinking of things:
>> 
>>        1. There are 4 base types in structure/.
>>                - Primitive: string, long, float, int, … (will constrain
>> these at some point).
>>                - TTuple<K,V>: key/value map.
>>                - TSequence<V>: an iterable of v objects.
>>                - TSymbol: like Ruby, I think we need “enum-like” symbols
>> (e.g., #id, #label).
>> 
>>        2. Every structure has a “root.”
>>                - for graph its TGraph implements TSequence<TVertex>
>>                - for rdbms its a TDatabase implements
>> TTuple<String,TTable>
>> 
>>        3. Roots implement Structure and thus, are what is generated by
>> StructureFactory.mint().
>>                - defined using withStructure().
>>                - For graph, its accessible via V().
>>                - For rdbms, its accessible via db().
>> 
>>        4. There is a list of core instructions for dealing with these
>> base objects.
>>                - value(K key): gets the TTuple value for the provided key.
>>                - values(K key): gets an iterator of the value for the
>> provided key.
>>                - entries(): gets an iterator of T2Tuple objects for the
>> incoming TTuple.
>>                - hasXXX(A,B): various has()-based filters for looking
>> into a TTuple and a TSequence
>>                - db()/V()/etc.: jump to the “root” of the withStructure()
>> structure.
>>                - drop()/add(): behave as one would expect and thus.
>> 
>> ————
>> 
>> For RDBMS, we have three interfaces in rdbms/.
>> (machine/machine-core/structure/rdbms)
>> 
>>        1. TDatabase implements TTuple<String,TTable> // the root
>> structure that indexes the tables.
>>        2. TTable implements TSequence<TRow<?>> // a table is a sequence
>> of rows
>>        3. TRow<V> implements TTuple<String,V>> // a row has string column
>> names
>> 
>> I then created a new project at machine/structure/jdbc). The classes in
>> here implement the above rdbms/ interfaces/
>> 
>> Here is an RDBMS session:
>> 
>> final Machine machine = LocalMachine.open();
>> final TraversalSource jdbc =
>>        Gremlin.traversal(machine).
>>                        withProcessor(PipesProcessor.class).
>>                        withStructure(JDBCStructure.class,
>> Map.of(JDBCStructure.JDBC_CONNECTION, "jdbc:h2:/tmp/test"));
>> 
>> System.out.println(jdbc.db().toList());
>> System.out.println(jdbc.db().entries().toList());
>> System.out.println(jdbc.db().value("people").toList());
>> System.out.println(jdbc.db().values("people").toList());
>> System.out.println(jdbc.db().values("people").value("name").toList());
>> System.out.println(jdbc.db().values("people").entries().toList());
>> 
>> This yields:
>> 
>> [<database#conn1: url=jdbc:h2:/tmp/test user=>]
>> [PEOPLE:<table#PEOPLE>]
>> [<table#people>]
>> [<row#PEOPLE:1>, <row#PEOPLE:2>]
>> [marko, josh]
>> [NAME:marko, AGE:29, NAME:josh, AGE:32]
>> 
>> The bytecode of the last query is:
>> 
>> [db(<database#conn1: url=jdbc:h2:/tmp/test user=>), values(people),
>> entries]
>> 
>> JDBCDatabase implements TDatabase, Structure.
>>        *** JDBCDatabase is the root structure and is referenced by db()
>> *** (CRUCIAL POINT)
>> 
>> Assume another table called ADDRESSES with two columns: name and city.
>> 
>> 
>> jdbc.db().values(“people”).as(“x”).db().values(“addresses”).has(“name”,eq(path(“x”).by(“name”))).value(“city”)
>> 
>> The above is equivalent to:
>> 
>> SELECT city FROM people,addresses WHERE people.name=addresses.name
>> 
>> If you want to do an inner join (a product), you do this:
>> 
>> 
>> jdbc.db().values(“people”).as(“x”).db().values(“addresses”).has(“name”,eq(path(“x”).by(“name”))).as(“y”).path(“x”,”y")
>> 
>> The above is equivalent to:
>> 
>> SELECT * FROM addresses INNER JOIN people ON people.name=addresses.name
>> 
>> NOTES:
>>        1. Instead of select(), we simply jump to the root via db() (or
>> V() for graph).
>>        2. Instead of project(), we simply use value() or values().
>>        3. Instead of select() being overloaded with by() join syntax, we
>> use has() and path().
>>                - like TP3 we will be smart about dropping path() data
>> once its no longer referenced.
>>        4. We can also do LEFT and RIGHT JOINs (haven’t thought through
>> FULL OUTER JOIN yet).
>>                - however, we don’t support ‘null' in TP so I don’t know
>> if we want to support these null-producing joins. ?
>> 
>> LEFT JOIN:
>>        * If an address doesn’t exist for the person, emit a “null”-filled
>> path.
>> 
>> jdbc.db().values(“people”).as(“x”).
>>  db().values(“addresses”).as(“y”).
>>    choose(has(“name”,eq(path(“x”).by(“name”))),
>>      identity(),
>>      path(“y”).by(null).as(“y”)).
>>  path(“x”,”y")
>> 
>> SELECT * FROM addresses LEFT JOIN people ON people.name=addresses.name
>> 
>> RIGHT JOIN:
>> 
>> jdbc.db().values(“people”).as(“x”).
>>  db().values(“addresses”).as(“y”).
>>    choose(has(“name”,eq(path(“x”).by(“name”))),
>>      identity(),
>>      path(“x”).by(null).as(“x”)).
>>  path(“x”,”y")
>> 
>> 
>> SUMMARY:
>> 
>> There are no “low level” instructions. Everything is based on the standard
>> instructions that we know and love. Finally, if not apparent, the above
>> bytecode chunks would ultimately get strategized into a single SQL query
>> (breadth-first) instead of one-off queries (depth-first) to improve
>> performance.
>> 
>> Neat?,
>> Marko.
>> 
>> http://rredux.com <http://rredux.com/> <http://rredux.com/ <http://rredux.com/>>


Re: The Fundamental Structure Instructions Already Exist! (w/ RDBMS Example)

Posted by Daniel Kuppitz <me...@gremlin.guru>.
>
> we don’t support ‘null' in TP


I don't think it's a good idea to keep this mindset for TP4; NULLs are too
important in RDBMS. I don't know, maybe you can convince SQL people that
dropping a value is the same as setting its value to NULL. It would work
for you and me and everybody else who's familiar with Gremlin, but SQL
people really love their NULLs....

TSymbol: like Ruby, I think we need “enum-like” symbols (e.g., #id, #label).


I'd prefer to just have special accessors for these. E.g. g.V().meta("id").
At least valueMaps would then only have String-keys.
I see the issue with that (naming collisions), but it's still better than
the enums in my opinion (which became a pain when started to implement
GLVs).

Also, what I'm wondering about now: Have you thought about Stored
Procedures and Views in RDBMS? Views can be treated as tables, easy, but
what about stored procedures? SPs can be found in many more DBMS, would be
bad to not support them (or hack something ugly together later in the
development process).

Cheers,
Daniel


On Mon, Apr 29, 2019 at 7:34 AM Marko Rodriguez <ok...@gmail.com>
wrote:

> Hi,
>
> *** This email is primarily for Josh (and Kuppitz). However, if others are
> interested… ***
>
> So I did a lot of thinking this weekend about structure/ and this morning,
> I prototyped both graph/ and rdbms/.
>
> This is the way I’m currently thinking of things:
>
>         1. There are 4 base types in structure/.
>                 - Primitive: string, long, float, int, … (will constrain
> these at some point).
>                 - TTuple<K,V>: key/value map.
>                 - TSequence<V>: an iterable of v objects.
>                 - TSymbol: like Ruby, I think we need “enum-like” symbols
> (e.g., #id, #label).
>
>         2. Every structure has a “root.”
>                 - for graph its TGraph implements TSequence<TVertex>
>                 - for rdbms its a TDatabase implements
> TTuple<String,TTable>
>
>         3. Roots implement Structure and thus, are what is generated by
> StructureFactory.mint().
>                 - defined using withStructure().
>                 - For graph, its accessible via V().
>                 - For rdbms, its accessible via db().
>
>         4. There is a list of core instructions for dealing with these
> base objects.
>                 - value(K key): gets the TTuple value for the provided key.
>                 - values(K key): gets an iterator of the value for the
> provided key.
>                 - entries(): gets an iterator of T2Tuple objects for the
> incoming TTuple.
>                 - hasXXX(A,B): various has()-based filters for looking
> into a TTuple and a TSequence
>                 - db()/V()/etc.: jump to the “root” of the withStructure()
> structure.
>                 - drop()/add(): behave as one would expect and thus.
>
> ————
>
> For RDBMS, we have three interfaces in rdbms/.
> (machine/machine-core/structure/rdbms)
>
>         1. TDatabase implements TTuple<String,TTable> // the root
> structure that indexes the tables.
>         2. TTable implements TSequence<TRow<?>> // a table is a sequence
> of rows
>         3. TRow<V> implements TTuple<String,V>> // a row has string column
> names
>
> I then created a new project at machine/structure/jdbc). The classes in
> here implement the above rdbms/ interfaces/
>
> Here is an RDBMS session:
>
> final Machine machine = LocalMachine.open();
> final TraversalSource jdbc =
>         Gremlin.traversal(machine).
>                         withProcessor(PipesProcessor.class).
>                         withStructure(JDBCStructure.class,
> Map.of(JDBCStructure.JDBC_CONNECTION, "jdbc:h2:/tmp/test"));
>
> System.out.println(jdbc.db().toList());
> System.out.println(jdbc.db().entries().toList());
> System.out.println(jdbc.db().value("people").toList());
> System.out.println(jdbc.db().values("people").toList());
> System.out.println(jdbc.db().values("people").value("name").toList());
> System.out.println(jdbc.db().values("people").entries().toList());
>
> This yields:
>
> [<database#conn1: url=jdbc:h2:/tmp/test user=>]
> [PEOPLE:<table#PEOPLE>]
> [<table#people>]
> [<row#PEOPLE:1>, <row#PEOPLE:2>]
> [marko, josh]
> [NAME:marko, AGE:29, NAME:josh, AGE:32]
>
> The bytecode of the last query is:
>
> [db(<database#conn1: url=jdbc:h2:/tmp/test user=>), values(people),
> entries]
>
> JDBCDatabase implements TDatabase, Structure.
>         *** JDBCDatabase is the root structure and is referenced by db()
> *** (CRUCIAL POINT)
>
> Assume another table called ADDRESSES with two columns: name and city.
>
>
> jdbc.db().values(“people”).as(“x”).db().values(“addresses”).has(“name”,eq(path(“x”).by(“name”))).value(“city”)
>
> The above is equivalent to:
>
> SELECT city FROM people,addresses WHERE people.name=addresses.name
>
> If you want to do an inner join (a product), you do this:
>
>
> jdbc.db().values(“people”).as(“x”).db().values(“addresses”).has(“name”,eq(path(“x”).by(“name”))).as(“y”).path(“x”,”y")
>
> The above is equivalent to:
>
> SELECT * FROM addresses INNER JOIN people ON people.name=addresses.name
>
> NOTES:
>         1. Instead of select(), we simply jump to the root via db() (or
> V() for graph).
>         2. Instead of project(), we simply use value() or values().
>         3. Instead of select() being overloaded with by() join syntax, we
> use has() and path().
>                 - like TP3 we will be smart about dropping path() data
> once its no longer referenced.
>         4. We can also do LEFT and RIGHT JOINs (haven’t thought through
> FULL OUTER JOIN yet).
>                 - however, we don’t support ‘null' in TP so I don’t know
> if we want to support these null-producing joins. ?
>
> LEFT JOIN:
>         * If an address doesn’t exist for the person, emit a “null”-filled
> path.
>
> jdbc.db().values(“people”).as(“x”).
>   db().values(“addresses”).as(“y”).
>     choose(has(“name”,eq(path(“x”).by(“name”))),
>       identity(),
>       path(“y”).by(null).as(“y”)).
>   path(“x”,”y")
>
> SELECT * FROM addresses LEFT JOIN people ON people.name=addresses.name
>
> RIGHT JOIN:
>
> jdbc.db().values(“people”).as(“x”).
>   db().values(“addresses”).as(“y”).
>     choose(has(“name”,eq(path(“x”).by(“name”))),
>       identity(),
>       path(“x”).by(null).as(“x”)).
>   path(“x”,”y")
>
>
> SUMMARY:
>
> There are no “low level” instructions. Everything is based on the standard
> instructions that we know and love. Finally, if not apparent, the above
> bytecode chunks would ultimately get strategized into a single SQL query
> (breadth-first) instead of one-off queries (depth-first) to improve
> performance.
>
> Neat?,
> Marko.
>
> http://rredux.com <http://rredux.com/>
>
>
>
>
>