You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nifi.apache.org by Mike Thomsen <mi...@gmail.com> on 2018/05/12 11:01:43 UTC

Graph database support w/ NiFi

I was wondering if anyone on the dev list had given much thought to graph
database support in NiFi. There are a lot of graph databases out there, and
many of them seem to be half-baked or barely supported. Narrowing it down,
it looks like the best candidates for a no fuss, decent sized graph that we
could build up with NiFi processors would be OrientDB, Neo4J and ArangoDB.
The first two are particularly attractive because they offer JDBC drivers
which opens the potential to making them even part of the standard
JDBC-based processors.

Anyone have any opinions or insights on this issue? I might have to do
OrientDB anyway, but if someone has a good feel for the market and can make
recommendations that would be appreciated.

Thanks,

Mike

Re: Graph database support w/ NiFi

Posted by "Uwe@Moosheimer.com" <Uw...@Moosheimer.com>.

joe

Wouldn't it be good to integrate Apache Atlas more to NiFi?

What I mean is just using something existing before doing it on any new way.

Mit freundlichen Grüßen / best regards
Kay-Uwe Moosheimer

> Am 12.05.2018 um 13:07 schrieb Joe Witt <jo...@gmail.com>:
> 
> mike
> 
> Do you mean support to send data to a graphdb?
> 
> A really awesome case would be sending provenance data to one and building
> queries, etc... around it!
> 
> I know mattyb would be all over that.
> 
> Thanks
> 
>> On Sat, May 12, 2018, 7:02 AM Mike Thomsen <mi...@gmail.com> wrote:
>> 
>> I was wondering if anyone on the dev list had given much thought to graph
>> database support in NiFi. There are a lot of graph databases out there, and
>> many of them seem to be half-baked or barely supported. Narrowing it down,
>> it looks like the best candidates for a no fuss, decent sized graph that we
>> could build up with NiFi processors would be OrientDB, Neo4J and ArangoDB.
>> The first two are particularly attractive because they offer JDBC drivers
>> which opens the potential to making them even part of the standard
>> JDBC-based processors.
>> 
>> Anyone have any opinions or insights on this issue? I might have to do
>> OrientDB anyway, but if someone has a good feel for the market and can make
>> recommendations that would be appreciated.
>> 
>> Thanks,
>> 
>> Mike
>>

Re: Graph database support w/ NiFi

Posted by Joe Witt <jo...@gmail.com>.

mike

Do you mean support to send data to a graphdb?

A really awesome case would be sending provenance data to one and building
queries, etc... around it!

I know mattyb would be all over that.

Thanks

On Sat, May 12, 2018, 7:02 AM Mike Thomsen <mi...@gmail.com> wrote:

> I was wondering if anyone on the dev list had given much thought to graph
> database support in NiFi. There are a lot of graph databases out there, and
> many of them seem to be half-baked or barely supported. Narrowing it down,
> it looks like the best candidates for a no fuss, decent sized graph that we
> could build up with NiFi processors would be OrientDB, Neo4J and ArangoDB.
> The first two are particularly attractive because they offer JDBC drivers
> which opens the potential to making them even part of the standard
> JDBC-based processors.
>
> Anyone have any opinions or insights on this issue? I might have to do
> OrientDB anyway, but if someone has a good feel for the market and can make
> recommendations that would be appreciated.
>
> Thanks,
>
> Mike
>

Re: Graph database support w/ NiFi

Posted by Otto Fowler <ot...@gmail.com>.

+1 for the wiki page

On May 12, 2018 at 10:52:43, Matt Burgess (mattyb149@apache.org) wrote:

All,

As Joe implied, I'm very happy that we are discussing graph tech in
relation to NiFi! NiFi and Graph theory/tech/analytics are passions of
mine. Mike, the examples you list are great, I would add Titan (and
its fork Janusgraph as Kay-Uwe mentioned) and Azure CosmosDB (these
and others are at [1]). I think there are at least four aspects to
this:

1) Graph query/traversal: This deals with getting data out of a graph
database and into flow file(s) for further processing. Here I agree
with Kay-Uwe that we should consider Apache Tinkerpop as the main
library for graph query/traversal, for a few reasons. The first as
Kay-Uwe said is that there are many adapters for Tinkerpop (TP) to
connect to various databases, from Mike's list I believe ArangoDB is
the only one that does not yet have a TP adapter. The second is
informed by the first, TP is a standard interface and graph traversal
engine with a common DSL in Gremlin. A third is that Gremlin is a
Groovy-based DSL, and Groovy syntax is fairly close to Java 8+ syntax
and you can call Groovy/Gremlin from Java and vice versa. A third is
that Tinkerpop is an Apache TLP with a very active and vibrant
community, so we will be able to reap the benefits of all the graph
goodness they develop moving forward. I think a QueryGraph processor
could be appropriate, perhaps with a GraphDBConnectionPool controller
service or something of the like. Apache DBCP can't do the pooling for
us, but we could implement something similar to that for pooling TP
connections.

2) Graph ingest: This one IMO is the long pole in the tent. Gremlin is
a graph traversal language, and although its API has addVertex() and
addEdge() methods and such, it seems like an inefficient solution,
akin to using individual INSERTs in an RDBMS rather than a
PreparedStatement or a bulk load. Keeping the analogy, bulk loading in
RDBMSs is usually specific to that DB, and the same goes for graphs.
The Titan-based ones have Titan-Hadoop (formerly Faunus), Neo4j has
external tools (not sure if there's a Java API or not) and Cypher,
OrientDB has an ETL pipeline system, etc. If we have a standard Graph
concept, we could have controller services / writers that are
system-specific (see aspect #4).

3) Arbitrary data -> Graph: Converting non-graph data into a graph
almost always takes domain knowledge, which NiFi itself won't have and
will thus have to be provided by the user. We'd need to make it as
simple as possible but also as powerful and flexible as possible in
order to get the most value. We can investigate how each of the
systems in aspect #2 approaches this, and perhaps come up with a good
user experience around it.

4) Organization and implementation: I think we should make sure to
keep the capabilities very loosely coupled in terms of which
modules/NARs/JARs provide which capabilities, to allow for maximum
flexibility and ease of future development. I would prefer an
API/libraries module akin to nifi-hadoop-libraries-nar, which would
only include Apache Tinkerpop and any dependencies needed to do "pure"
graph stuff, so probably no TP adapters except tinkergraph (and/or its
faster fork from ShiftLeft [2]). The reason I say that is so NiFi
components (and even the framework!) could use graphs in a lightweight
manner, without lots of heavy and possibly unnecessary dependencies.
Imagine being able to query your own flows using Gremlin or Cypher! I
also envision an API much like the Record API in NiFi but for graphs,
so we'd have GraphReaders and GraphWriters perhaps, they could convert
from GraphML to GraphSON or Kryo for example, or in conjunction with a
ConvertRecordToGraph processor, could be used to support the
capability in aspect #3 above. I'd also be looking at bringing in
Gremlin to the scripting processors, or having a Gremlin based
scripting bundle as NiFi's graph capabilities mature.

You might be able to tell I'm excited about this discussion ;) Should
we get a Wiki page going for ideas, and/or keep it going here, or
something else? I'm all ears for thoughts, questions, and ideas
(especially the ones that might seem crazy!)

Regards,
Matt

[1] http://tinkerpop.apache.org/providers.html
[2] https://github.com/ShiftLeftSecurity/tinkergraph-gremlin

On Sat, May 12, 2018 at 8:02 AM, Uwe@Moosheimer.com <Uw...@moosheimer.com>
wrote:
> Hi Mike,
>
> graph database support is not quite as easy as it seems.
> Unlike relational databases, graphs have not only defined vertices and
edges (labeled vertices and edges), they are directed or not and might have
attributes at the nodes and edges, too.
>
> This makes it a bit confusing for a general interface.
>
> In general, a graph database should always be accessed via TinkerPop 3
(or higher), since every professional graph database supports TinkerPop.
> TinkerPop is for graph databases what jdbc is for relational databases.
>
> I tried to create a general NiFi processor for graph databases myself and
then quit.
> Unlike relational databases, graph databases usually have many
dependencies.
>
> You do not simply create a data set but search for a particular vertex
(which may still have certain edges) and create further edges and vertices
at that.
> And the search for the correct node is usually context-related.
>
> This makes it difficult to do something general for all requirements.
>
> In any case I am looking forward to your concept and how you want to
solve it.
> It's definitely a good idea but hard to solve.
>
> Btw.: You forgot the most important graph database - Janusgraph.
>
> Mit freundlichen Grüßen / best regards
> Kay-Uwe Moosheimer
>
>> Am 12.05.2018 um 13:01 schrieb Mike Thomsen <mi...@gmail.com>:
>>
>> I was wondering if anyone on the dev list had given much thought to
graph
>> database support in NiFi. There are a lot of graph databases out there,
and
>> many of them seem to be half-baked or barely supported. Narrowing it
down,
>> it looks like the best candidates for a no fuss, decent sized graph that
we
>> could build up with NiFi processors would be OrientDB, Neo4J and
ArangoDB.
>> The first two are particularly attractive because they offer JDBC
drivers
>> which opens the potential to making them even part of the standard
>> JDBC-based processors.
>>
>> Anyone have any opinions or insights on this issue? I might have to do
>> OrientDB anyway, but if someone has a good feel for the market and can
make
>> recommendations that would be appreciated.
>>
>> Thanks,
>>
>> Mike
>

Re: Graph database support w/ NiFi

Posted by "Uwe@Moosheimer.com" <Uw...@Moosheimer.com>.

Mike, Matt,

I agree with Matt.

As I understood it, Mike talks about trying Janusgraph. That doesn't
mean that "only" Janusgraph is supported? Do I see that right Mike?
Wen should build a graph processor on Tinkerpop 3.x to support different
graph databases.

In my opinion, the standard API concept doesn't quite work for Graph
databases anymore.

Not only do we have dependencies. We also need to know how to handle it
when connecting an edge with two nodes. Do both nodes already exist? Or
only one of them? Should the non-existing node(s) be created or not?

What about attributes at nodes and edges? And what about
meta-information attached to attributes?

Also the question is whether we only make an insert. What about updates
and deletes?

I think we need something like this:
{
"relation" {
 "from": {
   "type": "PersonRecord",
   "values": [
      "{"name": "name", value": "MyName", "type": "string", "meta": ... },
      "{"name": "dob", value": "1514819125", "type": "date", "meta" ... },
    ...
    ]      
 },
 "to": {
   "type": "PersonRecord",
   "values": [
      "{"name": "name", value": "NextName", "type": "string", "meta": ... },
      "{"name": "dob", value": "1514819125", "type": "date", "meta" ... },
    ...
    ]
 },
 "how": {
     "direction": "out",
     "edgeLabel": "emailed"
     "values": [
         "{"name": "created", value": "1234567890", "type": "date",
"meta": ... },
       ...
       ]
  }
}
}

But also:

{
"set" {
 "vertex": {
   "type": "PersonRecord",
   "values": [
      "{"name": "name", value": "MyName", "type": "string", "meta": ... },
      "{"name": "dob", value": "1514819125", "type": "date", "meta" ... },
    ...
    ]      
 }
}


{
"delete" {
 "vertex": {
   "type": "PersonRecord",
   "values": [
      "{"name": "name", value": "MyName"},
      "{"name": "dob", value": "1514819125",
    ...
    ]
}

How do we deal with this if there are multiple vertices in the update or
delete (search for the vertex and receive more than one result)?
But also the question, how do we tell the API that we want to delete a
node and all its attached edges and nodes as long as they have a certain
behavior (e.g. delete all the attached nodes if the edge is of type
XYZ). Recursively downwards or upwards).

This is the special thing about graph databases. In an RDBMS this can be
done by the database (on delete cascade or by stored procedure). In a
NoSQL database we usually don't have the problem either.

To do this via the registry is a very good idea. And in general only to
access the graph database via Gremlin Server API is in my opinion the
only way to generalize this.

It might also be worth considering making an API that supports Gremlin
(and Cypher). Then Gremlin will be sent to the Gremlin server.
E.g. g.V().hasLabel(${"vertex.type"}).has("${"vertex.values.name"}...drop()
etc.

Regards,
Uwe

Am 15.10.2018 um 23:18 schrieb Matt Burgess:
> There are a few graph data formats that are supported by Gremlin, such as GraphSON, GraphML, Kryo, etc. I don’t think the format used by D3.js is supported, but we could support any/all of them in NiFi.
>
> I’m picturing either a graph version of the Record API, either “pure” or subclassed from the standard API. The standard API has a single “record” concept, central to its model, where a property graph model of course has two entities, nodes and edges, where edges are related to other nodes, where our record API doesn’t inherently have relationships between records.
>
> This Graph API should focus squarely on the property graph model and not any particular GraphDB tech. Having said that, we can use Apache Tinkerpop core in the Graph API without touching Gremlin, just for the property graph model stuff. The individual processors would handle moving/fetching data from whichever DBs using whichever language/library. That way the Cypher processors and Gremlin processors could use the same Graph Reader/Writer API.
>
> The bigger challenge IMO is how to write processors to convert record-based data to the graph model. It sometimes seems simple, but only for well-aligned and well-prepared data. Take provenance for example, the lineage is based on time (if you sort the nodes) rather than an explicit relationship.  But that can be for another discussion :)
>
> Regards,
> Matt
>
>> On Oct 15, 2018, at 4:38 PM, Mike Thomsen <mi...@gmail.com> wrote:
>>
>> Uwe,
>>
>> I had a chance to get into JanusGraph w/ Gremlin Server today. Any thoughts
>> on how you would integrate that? I have some inchoate thoughts about how to
>> build some sort of Avro-based reader setup so you can do strongly typed
>> associations sorta like this:
>>
>> {
>>  "from": {
>>    "type": "PersonRecord",
>>    "value": { ....}
>>  },
>>  "to": {
>>    "type": "PersonRecord",
>>    "value": { ....}
>>  },
>>  "direction": "out",
>>  "edgeLabel": "emailed"
>> }
>>
>> We could mix that with the schema registry APIs to generate Gremlin syntax
>> to send to the Gremlin server.
>>
>> First time I've done this, so please (Matt too) let me know what you think.
>>
>> Thanks,
>>
>> Mike
>>
>>> On Sun, Oct 14, 2018 at 6:07 AM Mike Thomsen <mi...@gmail.com> wrote:
>>>
>>> We have a Neo4J processor in a PR, but it is very much tied to Neo4J and
>>> Cypher. I was raising the issue that we might want to take that PR and
>>> extend it into an "ExecuteCypherQuery" processor with controller services
>>> that use either cypher for gremlin or the neo4j driver.
>>>
>>> On Sun, Oct 14, 2018 at 6:03 AM Uwe@Moosheimer.com <Uw...@moosheimer.com>
>>> wrote:
>>>
>>>> Mike,
>>>>
>>>> Cypher for Gremlin is a good idea. We can start with it and then later
>>>> allow an alternative so that users can use either Cypher or Gremlin
>>>> directly.
>>>>
>>>> To set the focus on Neo4J or Janusgraph or xyz is in my opinion not
>>>> target-oriented.
>>>> We should have a NiFi Graph processor that supports Tinkerpop. Via the
>>>> Gremlin server we can support all Tinkerpop capable graph databases
>>>> (
>>>> https://github.com/apache/tinkerpop/blob/master/gremlin-server/conf/gremlin-server-neo4j.yaml
>>>> ).
>>>>
>>>> Via a controller service we can then connect either Neo4J or Janusgraph
>>>> or any other graph DB.
>>>> Otherwise we would have to build a processor for each Graph DB.
>>>> We don't do that in NiFi for RDBMS either. There we have an ExecuteSQL
>>>> or PutSQL and say about the controller service what we want to connect.
>>>>
>>>> What do you mean Mike?
>>>>
>>>> Best Regards,
>>>> Uwe
>>>>
>>>>> Am 06.10.2018 um 00:15 schrieb Mike Thomsen:
>>>>> Uwe and Matt,
>>>>>
>>>>> Now that we're dipping our toes into Neo4J and Cypher, any thoughts on
>>>> this?
>>>>> https://github.com/opencypher/cypher-for-gremlin
>>>>>
>>>>> I'm wondering if we shouldn't work with mans2singh to take the Neo4J
>>>> work
>>>>> and push it further into having a client API that can let us inject a
>>>>> service that uses that or one that uses Neo4J's drivers.
>>>>>
>>>>> Mike
>>>>>
>>>>> On Mon, May 14, 2018 at 7:13 AM Otto Fowler <ot...@gmail.com>
>>>> wrote:
>>>>>> The wiki discussion should list these and other points of concern and
>>>>>> should document the extent to which
>>>>>> they are to be addressed.
>>>>>>
>>>>>>
>>>>>> On May 12, 2018 at 12:37:59, Uwe@Moosheimer.com (uwe@moosheimer.com)
>>>>>> wrote:
>>>>>>
>>>>>> Matt,
>>>>>>
>>>>>> You have some interesting ideas that I really like.
>>>>>> GraphReaders and GraphWriters would be interesting. When I started
>>>>>> writing a graph processor with my idea, the concept was not yet
>>>>>> implemented in NiFi.
>>>>>> I don't find GraphML and GraphSON so tingly because they contain e.g.
>>>>>> the Vertex/Edge IDs and serve as import and export format to my
>>>>>> knowledge (correct me if I'm wrong).
>>>>>>
>>>>>> A ConvertRecordToGraph processor is a good approach, the only question
>>>>>> is from which format we can convert?
>>>>>>
>>>>>> I also think to make a graph processor a bit general we would have to
>>>>>> provide a query as input which provides the correct vertex from which
>>>>>> the graph should be extended.
>>>>>> Maybe like your suggestion with a gremlin query or a small gremlin
>>>> script.
>>>>>> If a vertex is found a new edge and a new vertex are added.
>>>>>> It asks how we transmit the individual attributes to the edge and
>>>> vertex
>>>>>> as well as the labels of the edge and vertex? Possibly with NiFi
>>>>>> attributes?
>>>>>>
>>>>>> I have some headaches about the complexity.
>>>>>> A small example:
>>>>>> Imagine we have a set from a CSV file.
>>>>>> The columns are Set ID, Token1, Token2, Token3...
>>>>>> ID, Token1,Token2,Token3,Token4,Token5
>>>>>> 123, Mary, had, a, little, lamp
>>>>>>
>>>>>> I want to create a vertex with ID 123 (if not exists). Then I want to
>>>>>> check for each token if a vertex exists in the graph database (search
>>>>>> for vertex with label "Token" and attribute "name"="Mary"). If the
>>>>>> vertex does not exist, the vertex has to be created.
>>>>>> Since I want to save e.g. Wikipedia to my graph I want to avoid the
>>>>>> supernode problem for the token vertices. I create a few distribution
>>>>>> vertices for each vertex that belongs to a token. If there is a vertex
>>>>>> for Token1(Mary) then I don't want to make the edge from this vertex to
>>>>>> my vertex with the ID 123, but from one of the distribution vertices.
>>>>>> If the vertex for the token does not exist, the distribution vertices
>>>>>> have also to be created ... and so on...
>>>>>>
>>>>>> Even with this very simple example it seems to become difficult with a
>>>>>> universal processor.
>>>>>>
>>>>>> In any case I think the idea to implement a graph processor in NiFi is
>>>> a
>>>>>> good one.
>>>>>> The more we work on it the more good ideas we get and maybe only I
>>>> can't
>>>>>> see the forest for the trees.
>>>>>>
>>>>>> One question about Titan. To my knowledge, Titan has been dead for a
>>>>>> year and a half and Janusgraph is the successor?
>>>>>> Titan has become unofficially Datastax Enterprise Graph?!
>>>>>> Supporting Titan could become difficult because Titan does not support
>>>>>> my knowledge after TinkerPop 3 and is no longer maintained.
>>>>>>
>>>>>> I like your idea for a wiki page for more ideas. In the many mails one
>>>>>> loses oneself otherwise.
>>>>>>
>>>>>> Regards,
>>>>>> Kay-Uwe
>>>>>>
>>>>>>> Am 12.05.2018 um 16:52 schrieb Matt Burgess:
>>>>>>> All,
>>>>>>>
>>>>>>> As Joe implied, I'm very happy that we are discussing graph tech in
>>>>>>> relation to NiFi! NiFi and Graph theory/tech/analytics are passions of
>>>>>>> mine. Mike, the examples you list are great, I would add Titan (and
>>>>>>> its fork Janusgraph as Kay-Uwe mentioned) and Azure CosmosDB (these
>>>>>>> and others are at [1]). I think there are at least four aspects to
>>>>>>> this:
>>>>>>>
>>>>>>> 1) Graph query/traversal: This deals with getting data out of a graph
>>>>>>> database and into flow file(s) for further processing. Here I agree
>>>>>>> with Kay-Uwe that we should consider Apache Tinkerpop as the main
>>>>>>> library for graph query/traversal, for a few reasons. The first as
>>>>>>> Kay-Uwe said is that there are many adapters for Tinkerpop (TP) to
>>>>>>> connect to various databases, from Mike's list I believe ArangoDB is
>>>>>>> the only one that does not yet have a TP adapter. The second is
>>>>>>> informed by the first, TP is a standard interface and graph traversal
>>>>>>> engine with a common DSL in Gremlin. A third is that Gremlin is a
>>>>>>> Groovy-based DSL, and Groovy syntax is fairly close to Java 8+ syntax
>>>>>>> and you can call Groovy/Gremlin from Java and vice versa. A third is
>>>>>>> that Tinkerpop is an Apache TLP with a very active and vibrant
>>>>>>> community, so we will be able to reap the benefits of all the graph
>>>>>>> goodness they develop moving forward. I think a QueryGraph processor
>>>>>>> could be appropriate, perhaps with a GraphDBConnectionPool controller
>>>>>>> service or something of the like. Apache DBCP can't do the pooling for
>>>>>>> us, but we could implement something similar to that for pooling TP
>>>>>>> connections.
>>>>>>>
>>>>>>> 2) Graph ingest: This one IMO is the long pole in the tent. Gremlin is
>>>>>>> a graph traversal language, and although its API has addVertex() and
>>>>>>> addEdge() methods and such, it seems like an inefficient solution,
>>>>>>> akin to using individual INSERTs in an RDBMS rather than a
>>>>>>> PreparedStatement or a bulk load. Keeping the analogy, bulk loading in
>>>>>>> RDBMSs is usually specific to that DB, and the same goes for graphs.
>>>>>>> The Titan-based ones have Titan-Hadoop (formerly Faunus), Neo4j has
>>>>>>> external tools (not sure if there's a Java API or not) and Cypher,
>>>>>>> OrientDB has an ETL pipeline system, etc. If we have a standard Graph
>>>>>>> concept, we could have controller services / writers that are
>>>>>>> system-specific (see aspect #4).
>>>>>>>
>>>>>>> 3) Arbitrary data -> Graph: Converting non-graph data into a graph
>>>>>>> almost always takes domain knowledge, which NiFi itself won't have and
>>>>>>> will thus have to be provided by the user. We'd need to make it as
>>>>>>> simple as possible but also as powerful and flexible as possible in
>>>>>>> order to get the most value. We can investigate how each of the
>>>>>>> systems in aspect #2 approaches this, and perhaps come up with a good
>>>>>>> user experience around it.
>>>>>>>
>>>>>>> 4) Organization and implementation: I think we should make sure to
>>>>>>> keep the capabilities very loosely coupled in terms of which
>>>>>>> modules/NARs/JARs provide which capabilities, to allow for maximum
>>>>>>> flexibility and ease of future development. I would prefer an
>>>>>>> API/libraries module akin to nifi-hadoop-libraries-nar, which would
>>>>>>> only include Apache Tinkerpop and any dependencies needed to do "pure"
>>>>>>> graph stuff, so probably no TP adapters except tinkergraph (and/or its
>>>>>>> faster fork from ShiftLeft [2]). The reason I say that is so NiFi
>>>>>>> components (and even the framework!) could use graphs in a lightweight
>>>>>>> manner, without lots of heavy and possibly unnecessary dependencies.
>>>>>>> Imagine being able to query your own flows using Gremlin or Cypher! I
>>>>>>> also envision an API much like the Record API in NiFi but for graphs,
>>>>>>> so we'd have GraphReaders and GraphWriters perhaps, they could convert
>>>>>>> from GraphML to GraphSON or Kryo for example, or in conjunction with a
>>>>>>> ConvertRecordToGraph processor, could be used to support the
>>>>>>> capability in aspect #3 above. I'd also be looking at bringing in
>>>>>>> Gremlin to the scripting processors, or having a Gremlin based
>>>>>>> scripting bundle as NiFi's graph capabilities mature.
>>>>>>>
>>>>>>> You might be able to tell I'm excited about this discussion ;) Should
>>>>>>> we get a Wiki page going for ideas, and/or keep it going here, or
>>>>>>> something else? I'm all ears for thoughts, questions, and ideas
>>>>>>> (especially the ones that might seem crazy!)
>>>>>>>
>>>>>>> Regards,
>>>>>>> Matt
>>>>>>>
>>>>>>> [1] http://tinkerpop.apache.org/providers.html
>>>>>>> [2] https://github.com/ShiftLeftSecurity/tinkergraph-gremlin
>>>>>>>
>>>>>>> On Sat, May 12, 2018 at 8:02 AM, Uwe@Moosheimer.com <
>>>> Uwe@moosheimer.com>
>>>>>> wrote:
>>>>>>>> Hi Mike,
>>>>>>>>
>>>>>>>> graph database support is not quite as easy as it seems.
>>>>>>>> Unlike relational databases, graphs have not only defined vertices
>>>> and
>>>>>> edges (labeled vertices and edges), they are directed or not and might
>>>> have
>>>>>> attributes at the nodes and edges, too.
>>>>>>>> This makes it a bit confusing for a general interface.
>>>>>>>>
>>>>>>>> In general, a graph database should always be accessed via TinkerPop
>>>> 3
>>>>>> (or higher), since every professional graph database supports
>>>> TinkerPop.
>>>>>>>> TinkerPop is for graph databases what jdbc is for relational
>>>> databases.
>>>>>>>> I tried to create a general NiFi processor for graph databases myself
>>>>>> and then quit.
>>>>>>>> Unlike relational databases, graph databases usually have many
>>>>>> dependencies.
>>>>>>>> You do not simply create a data set but search for a particular
>>>> vertex
>>>>>> (which may still have certain edges) and create further edges and
>>>> vertices
>>>>>> at that.
>>>>>>>> And the search for the correct node is usually context-related.
>>>>>>>>
>>>>>>>> This makes it difficult to do something general for all requirements.
>>>>>>>>
>>>>>>>> In any case I am looking forward to your concept and how you want to
>>>>>> solve it.
>>>>>>>> It's definitely a good idea but hard to solve.
>>>>>>>>
>>>>>>>> Btw.: You forgot the most important graph database - Janusgraph.
>>>>>>>>
>>>>>>>> Mit freundlichen Grüßen / best regards
>>>>>>>> Kay-Uwe Moosheimer
>>>>>>>>
>>>>>>>>> Am 12.05.2018 um 13:01 schrieb Mike Thomsen <mikerthomsen@gmail.com
>>>>> :
>>>>>>>>> I was wondering if anyone on the dev list had given much thought to
>>>>>> graph
>>>>>>>>> database support in NiFi. There are a lot of graph databases out
>>>> there,
>>>>>> and
>>>>>>>>> many of them seem to be half-baked or barely supported. Narrowing it
>>>>>> down,
>>>>>>>>> it looks like the best candidates for a no fuss, decent sized graph
>>>>>> that we
>>>>>>>>> could build up with NiFi processors would be OrientDB, Neo4J and
>>>>>> ArangoDB.
>>>>>>>>> The first two are particularly attractive because they offer JDBC
>>>>>> drivers
>>>>>>>>> which opens the potential to making them even part of the standard
>>>>>>>>> JDBC-based processors.
>>>>>>>>>
>>>>>>>>> Anyone have any opinions or insights on this issue? I might have to
>>>> do
>>>>>>>>> OrientDB anyway, but if someone has a good feel for the market and
>>>> can
>>>>>> make
>>>>>>>>> recommendations that would be appreciated.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> Mike
>>>>
>>>>

Re: Graph database support w/ NiFi

Posted by Matt Burgess <ma...@gmail.com>.

There are a few graph data formats that are supported by Gremlin, such as GraphSON, GraphML, Kryo, etc. I don’t think the format used by D3.js is supported, but we could support any/all of them in NiFi.

I’m picturing either a graph version of the Record API, either “pure” or subclassed from the standard API. The standard API has a single “record” concept, central to its model, where a property graph model of course has two entities, nodes and edges, where edges are related to other nodes, where our record API doesn’t inherently have relationships between records.

This Graph API should focus squarely on the property graph model and not any particular GraphDB tech. Having said that, we can use Apache Tinkerpop core in the Graph API without touching Gremlin, just for the property graph model stuff. The individual processors would handle moving/fetching data from whichever DBs using whichever language/library. That way the Cypher processors and Gremlin processors could use the same Graph Reader/Writer API.

The bigger challenge IMO is how to write processors to convert record-based data to the graph model. It sometimes seems simple, but only for well-aligned and well-prepared data. Take provenance for example, the lineage is based on time (if you sort the nodes) rather than an explicit relationship.  But that can be for another discussion :)

Regards,
Matt

> On Oct 15, 2018, at 4:38 PM, Mike Thomsen <mi...@gmail.com> wrote:
> 
> Uwe,
> 
> I had a chance to get into JanusGraph w/ Gremlin Server today. Any thoughts
> on how you would integrate that? I have some inchoate thoughts about how to
> build some sort of Avro-based reader setup so you can do strongly typed
> associations sorta like this:
> 
> {
>  "from": {
>    "type": "PersonRecord",
>    "value": { ....}
>  },
>  "to": {
>    "type": "PersonRecord",
>    "value": { ....}
>  },
>  "direction": "out",
>  "edgeLabel": "emailed"
> }
> 
> We could mix that with the schema registry APIs to generate Gremlin syntax
> to send to the Gremlin server.
> 
> First time I've done this, so please (Matt too) let me know what you think.
> 
> Thanks,
> 
> Mike
> 
>> On Sun, Oct 14, 2018 at 6:07 AM Mike Thomsen <mi...@gmail.com> wrote:
>> 
>> We have a Neo4J processor in a PR, but it is very much tied to Neo4J and
>> Cypher. I was raising the issue that we might want to take that PR and
>> extend it into an "ExecuteCypherQuery" processor with controller services
>> that use either cypher for gremlin or the neo4j driver.
>> 
>> On Sun, Oct 14, 2018 at 6:03 AM Uwe@Moosheimer.com <Uw...@moosheimer.com>
>> wrote:
>> 
>>> Mike,
>>> 
>>> Cypher for Gremlin is a good idea. We can start with it and then later
>>> allow an alternative so that users can use either Cypher or Gremlin
>>> directly.
>>> 
>>> To set the focus on Neo4J or Janusgraph or xyz is in my opinion not
>>> target-oriented.
>>> We should have a NiFi Graph processor that supports Tinkerpop. Via the
>>> Gremlin server we can support all Tinkerpop capable graph databases
>>> (
>>> https://github.com/apache/tinkerpop/blob/master/gremlin-server/conf/gremlin-server-neo4j.yaml
>>> ).
>>> 
>>> Via a controller service we can then connect either Neo4J or Janusgraph
>>> or any other graph DB.
>>> Otherwise we would have to build a processor for each Graph DB.
>>> We don't do that in NiFi for RDBMS either. There we have an ExecuteSQL
>>> or PutSQL and say about the controller service what we want to connect.
>>> 
>>> What do you mean Mike?
>>> 
>>> Best Regards,
>>> Uwe
>>> 
>>>> Am 06.10.2018 um 00:15 schrieb Mike Thomsen:
>>>> Uwe and Matt,
>>>> 
>>>> Now that we're dipping our toes into Neo4J and Cypher, any thoughts on
>>> this?
>>>> 
>>>> https://github.com/opencypher/cypher-for-gremlin
>>>> 
>>>> I'm wondering if we shouldn't work with mans2singh to take the Neo4J
>>> work
>>>> and push it further into having a client API that can let us inject a
>>>> service that uses that or one that uses Neo4J's drivers.
>>>> 
>>>> Mike
>>>> 
>>>> On Mon, May 14, 2018 at 7:13 AM Otto Fowler <ot...@gmail.com>
>>> wrote:
>>>> 
>>>>> The wiki discussion should list these and other points of concern and
>>>>> should document the extent to which
>>>>> they are to be addressed.
>>>>> 
>>>>> 
>>>>> On May 12, 2018 at 12:37:59, Uwe@Moosheimer.com (uwe@moosheimer.com)
>>>>> wrote:
>>>>> 
>>>>> Matt,
>>>>> 
>>>>> You have some interesting ideas that I really like.
>>>>> GraphReaders and GraphWriters would be interesting. When I started
>>>>> writing a graph processor with my idea, the concept was not yet
>>>>> implemented in NiFi.
>>>>> I don't find GraphML and GraphSON so tingly because they contain e.g.
>>>>> the Vertex/Edge IDs and serve as import and export format to my
>>>>> knowledge (correct me if I'm wrong).
>>>>> 
>>>>> A ConvertRecordToGraph processor is a good approach, the only question
>>>>> is from which format we can convert?
>>>>> 
>>>>> I also think to make a graph processor a bit general we would have to
>>>>> provide a query as input which provides the correct vertex from which
>>>>> the graph should be extended.
>>>>> Maybe like your suggestion with a gremlin query or a small gremlin
>>> script.
>>>>> 
>>>>> If a vertex is found a new edge and a new vertex are added.
>>>>> It asks how we transmit the individual attributes to the edge and
>>> vertex
>>>>> as well as the labels of the edge and vertex? Possibly with NiFi
>>>>> attributes?
>>>>> 
>>>>> I have some headaches about the complexity.
>>>>> A small example:
>>>>> Imagine we have a set from a CSV file.
>>>>> The columns are Set ID, Token1, Token2, Token3...
>>>>> ID, Token1,Token2,Token3,Token4,Token5
>>>>> 123, Mary, had, a, little, lamp
>>>>> 
>>>>> I want to create a vertex with ID 123 (if not exists). Then I want to
>>>>> check for each token if a vertex exists in the graph database (search
>>>>> for vertex with label "Token" and attribute "name"="Mary"). If the
>>>>> vertex does not exist, the vertex has to be created.
>>>>> Since I want to save e.g. Wikipedia to my graph I want to avoid the
>>>>> supernode problem for the token vertices. I create a few distribution
>>>>> vertices for each vertex that belongs to a token. If there is a vertex
>>>>> for Token1(Mary) then I don't want to make the edge from this vertex to
>>>>> my vertex with the ID 123, but from one of the distribution vertices.
>>>>> If the vertex for the token does not exist, the distribution vertices
>>>>> have also to be created ... and so on...
>>>>> 
>>>>> Even with this very simple example it seems to become difficult with a
>>>>> universal processor.
>>>>> 
>>>>> In any case I think the idea to implement a graph processor in NiFi is
>>> a
>>>>> good one.
>>>>> The more we work on it the more good ideas we get and maybe only I
>>> can't
>>>>> see the forest for the trees.
>>>>> 
>>>>> One question about Titan. To my knowledge, Titan has been dead for a
>>>>> year and a half and Janusgraph is the successor?
>>>>> Titan has become unofficially Datastax Enterprise Graph?!
>>>>> Supporting Titan could become difficult because Titan does not support
>>>>> my knowledge after TinkerPop 3 and is no longer maintained.
>>>>> 
>>>>> I like your idea for a wiki page for more ideas. In the many mails one
>>>>> loses oneself otherwise.
>>>>> 
>>>>> Regards,
>>>>> Kay-Uwe
>>>>> 
>>>>>> Am 12.05.2018 um 16:52 schrieb Matt Burgess:
>>>>>> All,
>>>>>> 
>>>>>> As Joe implied, I'm very happy that we are discussing graph tech in
>>>>>> relation to NiFi! NiFi and Graph theory/tech/analytics are passions of
>>>>>> mine. Mike, the examples you list are great, I would add Titan (and
>>>>>> its fork Janusgraph as Kay-Uwe mentioned) and Azure CosmosDB (these
>>>>>> and others are at [1]). I think there are at least four aspects to
>>>>>> this:
>>>>>> 
>>>>>> 1) Graph query/traversal: This deals with getting data out of a graph
>>>>>> database and into flow file(s) for further processing. Here I agree
>>>>>> with Kay-Uwe that we should consider Apache Tinkerpop as the main
>>>>>> library for graph query/traversal, for a few reasons. The first as
>>>>>> Kay-Uwe said is that there are many adapters for Tinkerpop (TP) to
>>>>>> connect to various databases, from Mike's list I believe ArangoDB is
>>>>>> the only one that does not yet have a TP adapter. The second is
>>>>>> informed by the first, TP is a standard interface and graph traversal
>>>>>> engine with a common DSL in Gremlin. A third is that Gremlin is a
>>>>>> Groovy-based DSL, and Groovy syntax is fairly close to Java 8+ syntax
>>>>>> and you can call Groovy/Gremlin from Java and vice versa. A third is
>>>>>> that Tinkerpop is an Apache TLP with a very active and vibrant
>>>>>> community, so we will be able to reap the benefits of all the graph
>>>>>> goodness they develop moving forward. I think a QueryGraph processor
>>>>>> could be appropriate, perhaps with a GraphDBConnectionPool controller
>>>>>> service or something of the like. Apache DBCP can't do the pooling for
>>>>>> us, but we could implement something similar to that for pooling TP
>>>>>> connections.
>>>>>> 
>>>>>> 2) Graph ingest: This one IMO is the long pole in the tent. Gremlin is
>>>>>> a graph traversal language, and although its API has addVertex() and
>>>>>> addEdge() methods and such, it seems like an inefficient solution,
>>>>>> akin to using individual INSERTs in an RDBMS rather than a
>>>>>> PreparedStatement or a bulk load. Keeping the analogy, bulk loading in
>>>>>> RDBMSs is usually specific to that DB, and the same goes for graphs.
>>>>>> The Titan-based ones have Titan-Hadoop (formerly Faunus), Neo4j has
>>>>>> external tools (not sure if there's a Java API or not) and Cypher,
>>>>>> OrientDB has an ETL pipeline system, etc. If we have a standard Graph
>>>>>> concept, we could have controller services / writers that are
>>>>>> system-specific (see aspect #4).
>>>>>> 
>>>>>> 3) Arbitrary data -> Graph: Converting non-graph data into a graph
>>>>>> almost always takes domain knowledge, which NiFi itself won't have and
>>>>>> will thus have to be provided by the user. We'd need to make it as
>>>>>> simple as possible but also as powerful and flexible as possible in
>>>>>> order to get the most value. We can investigate how each of the
>>>>>> systems in aspect #2 approaches this, and perhaps come up with a good
>>>>>> user experience around it.
>>>>>> 
>>>>>> 4) Organization and implementation: I think we should make sure to
>>>>>> keep the capabilities very loosely coupled in terms of which
>>>>>> modules/NARs/JARs provide which capabilities, to allow for maximum
>>>>>> flexibility and ease of future development. I would prefer an
>>>>>> API/libraries module akin to nifi-hadoop-libraries-nar, which would
>>>>>> only include Apache Tinkerpop and any dependencies needed to do "pure"
>>>>>> graph stuff, so probably no TP adapters except tinkergraph (and/or its
>>>>>> faster fork from ShiftLeft [2]). The reason I say that is so NiFi
>>>>>> components (and even the framework!) could use graphs in a lightweight
>>>>>> manner, without lots of heavy and possibly unnecessary dependencies.
>>>>>> Imagine being able to query your own flows using Gremlin or Cypher! I
>>>>>> also envision an API much like the Record API in NiFi but for graphs,
>>>>>> so we'd have GraphReaders and GraphWriters perhaps, they could convert
>>>>>> from GraphML to GraphSON or Kryo for example, or in conjunction with a
>>>>>> ConvertRecordToGraph processor, could be used to support the
>>>>>> capability in aspect #3 above. I'd also be looking at bringing in
>>>>>> Gremlin to the scripting processors, or having a Gremlin based
>>>>>> scripting bundle as NiFi's graph capabilities mature.
>>>>>> 
>>>>>> You might be able to tell I'm excited about this discussion ;) Should
>>>>>> we get a Wiki page going for ideas, and/or keep it going here, or
>>>>>> something else? I'm all ears for thoughts, questions, and ideas
>>>>>> (especially the ones that might seem crazy!)
>>>>>> 
>>>>>> Regards,
>>>>>> Matt
>>>>>> 
>>>>>> [1] http://tinkerpop.apache.org/providers.html
>>>>>> [2] https://github.com/ShiftLeftSecurity/tinkergraph-gremlin
>>>>>> 
>>>>>> On Sat, May 12, 2018 at 8:02 AM, Uwe@Moosheimer.com <
>>> Uwe@moosheimer.com>
>>>>> wrote:
>>>>>>> Hi Mike,
>>>>>>> 
>>>>>>> graph database support is not quite as easy as it seems.
>>>>>>> Unlike relational databases, graphs have not only defined vertices
>>> and
>>>>> edges (labeled vertices and edges), they are directed or not and might
>>> have
>>>>> attributes at the nodes and edges, too.
>>>>>>> This makes it a bit confusing for a general interface.
>>>>>>> 
>>>>>>> In general, a graph database should always be accessed via TinkerPop
>>> 3
>>>>> (or higher), since every professional graph database supports
>>> TinkerPop.
>>>>>>> TinkerPop is for graph databases what jdbc is for relational
>>> databases.
>>>>>>> 
>>>>>>> I tried to create a general NiFi processor for graph databases myself
>>>>> and then quit.
>>>>>>> Unlike relational databases, graph databases usually have many
>>>>> dependencies.
>>>>>>> You do not simply create a data set but search for a particular
>>> vertex
>>>>> (which may still have certain edges) and create further edges and
>>> vertices
>>>>> at that.
>>>>>>> And the search for the correct node is usually context-related.
>>>>>>> 
>>>>>>> This makes it difficult to do something general for all requirements.
>>>>>>> 
>>>>>>> In any case I am looking forward to your concept and how you want to
>>>>> solve it.
>>>>>>> It's definitely a good idea but hard to solve.
>>>>>>> 
>>>>>>> Btw.: You forgot the most important graph database - Janusgraph.
>>>>>>> 
>>>>>>> Mit freundlichen Grüßen / best regards
>>>>>>> Kay-Uwe Moosheimer
>>>>>>> 
>>>>>>>> Am 12.05.2018 um 13:01 schrieb Mike Thomsen <mikerthomsen@gmail.com
>>>> :
>>>>>>>> 
>>>>>>>> I was wondering if anyone on the dev list had given much thought to
>>>>> graph
>>>>>>>> database support in NiFi. There are a lot of graph databases out
>>> there,
>>>>> and
>>>>>>>> many of them seem to be half-baked or barely supported. Narrowing it
>>>>> down,
>>>>>>>> it looks like the best candidates for a no fuss, decent sized graph
>>>>> that we
>>>>>>>> could build up with NiFi processors would be OrientDB, Neo4J and
>>>>> ArangoDB.
>>>>>>>> The first two are particularly attractive because they offer JDBC
>>>>> drivers
>>>>>>>> which opens the potential to making them even part of the standard
>>>>>>>> JDBC-based processors.
>>>>>>>> 
>>>>>>>> Anyone have any opinions or insights on this issue? I might have to
>>> do
>>>>>>>> OrientDB anyway, but if someone has a good feel for the market and
>>> can
>>>>> make
>>>>>>>> recommendations that would be appreciated.
>>>>>>>> 
>>>>>>>> Thanks,
>>>>>>>> 
>>>>>>>> Mike
>>> 
>>> 
>>>

Re: Graph database support w/ NiFi

Posted by Mike Thomsen <mi...@gmail.com>.

Uwe,

I had a chance to get into JanusGraph w/ Gremlin Server today. Any thoughts
on how you would integrate that? I have some inchoate thoughts about how to
build some sort of Avro-based reader setup so you can do strongly typed
associations sorta like this:

{
  "from": {
    "type": "PersonRecord",
    "value": { ....}
  },
  "to": {
    "type": "PersonRecord",
    "value": { ....}
  },
  "direction": "out",
  "edgeLabel": "emailed"
}

We could mix that with the schema registry APIs to generate Gremlin syntax
to send to the Gremlin server.

First time I've done this, so please (Matt too) let me know what you think.

Thanks,

Mike

On Sun, Oct 14, 2018 at 6:07 AM Mike Thomsen <mi...@gmail.com> wrote:

> We have a Neo4J processor in a PR, but it is very much tied to Neo4J and
> Cypher. I was raising the issue that we might want to take that PR and
> extend it into an "ExecuteCypherQuery" processor with controller services
> that use either cypher for gremlin or the neo4j driver.
>
> On Sun, Oct 14, 2018 at 6:03 AM Uwe@Moosheimer.com <Uw...@moosheimer.com>
> wrote:
>
>> Mike,
>>
>> Cypher for Gremlin is a good idea. We can start with it and then later
>> allow an alternative so that users can use either Cypher or Gremlin
>> directly.
>>
>> To set the focus on Neo4J or Janusgraph or xyz is in my opinion not
>> target-oriented.
>> We should have a NiFi Graph processor that supports Tinkerpop. Via the
>> Gremlin server we can support all Tinkerpop capable graph databases
>> (
>> https://github.com/apache/tinkerpop/blob/master/gremlin-server/conf/gremlin-server-neo4j.yaml
>> ).
>>
>> Via a controller service we can then connect either Neo4J or Janusgraph
>> or any other graph DB.
>> Otherwise we would have to build a processor for each Graph DB.
>> We don't do that in NiFi for RDBMS either. There we have an ExecuteSQL
>> or PutSQL and say about the controller service what we want to connect.
>>
>> What do you mean Mike?
>>
>> Best Regards,
>> Uwe
>>
>> Am 06.10.2018 um 00:15 schrieb Mike Thomsen:
>> > Uwe and Matt,
>> >
>> > Now that we're dipping our toes into Neo4J and Cypher, any thoughts on
>> this?
>> >
>> > https://github.com/opencypher/cypher-for-gremlin
>> >
>> > I'm wondering if we shouldn't work with mans2singh to take the Neo4J
>> work
>> > and push it further into having a client API that can let us inject a
>> > service that uses that or one that uses Neo4J's drivers.
>> >
>> > Mike
>> >
>> > On Mon, May 14, 2018 at 7:13 AM Otto Fowler <ot...@gmail.com>
>> wrote:
>> >
>> >> The wiki discussion should list these and other points of concern and
>> >> should document the extent to which
>> >> they are to be addressed.
>> >>
>> >>
>> >> On May 12, 2018 at 12:37:59, Uwe@Moosheimer.com (uwe@moosheimer.com)
>> >> wrote:
>> >>
>> >> Matt,
>> >>
>> >> You have some interesting ideas that I really like.
>> >> GraphReaders and GraphWriters would be interesting. When I started
>> >> writing a graph processor with my idea, the concept was not yet
>> >> implemented in NiFi.
>> >> I don't find GraphML and GraphSON so tingly because they contain e.g.
>> >> the Vertex/Edge IDs and serve as import and export format to my
>> >> knowledge (correct me if I'm wrong).
>> >>
>> >> A ConvertRecordToGraph processor is a good approach, the only question
>> >> is from which format we can convert?
>> >>
>> >> I also think to make a graph processor a bit general we would have to
>> >> provide a query as input which provides the correct vertex from which
>> >> the graph should be extended.
>> >> Maybe like your suggestion with a gremlin query or a small gremlin
>> script.
>> >>
>> >> If a vertex is found a new edge and a new vertex are added.
>> >> It asks how we transmit the individual attributes to the edge and
>> vertex
>> >> as well as the labels of the edge and vertex? Possibly with NiFi
>> >> attributes?
>> >>
>> >> I have some headaches about the complexity.
>> >> A small example:
>> >> Imagine we have a set from a CSV file.
>> >> The columns are Set ID, Token1, Token2, Token3...
>> >> ID, Token1,Token2,Token3,Token4,Token5
>> >> 123, Mary, had, a, little, lamp
>> >>
>> >> I want to create a vertex with ID 123 (if not exists). Then I want to
>> >> check for each token if a vertex exists in the graph database (search
>> >> for vertex with label "Token" and attribute "name"="Mary"). If the
>> >> vertex does not exist, the vertex has to be created.
>> >> Since I want to save e.g. Wikipedia to my graph I want to avoid the
>> >> supernode problem for the token vertices. I create a few distribution
>> >> vertices for each vertex that belongs to a token. If there is a vertex
>> >> for Token1(Mary) then I don't want to make the edge from this vertex to
>> >> my vertex with the ID 123, but from one of the distribution vertices.
>> >> If the vertex for the token does not exist, the distribution vertices
>> >> have also to be created ... and so on...
>> >>
>> >> Even with this very simple example it seems to become difficult with a
>> >> universal processor.
>> >>
>> >> In any case I think the idea to implement a graph processor in NiFi is
>> a
>> >> good one.
>> >> The more we work on it the more good ideas we get and maybe only I
>> can't
>> >> see the forest for the trees.
>> >>
>> >> One question about Titan. To my knowledge, Titan has been dead for a
>> >> year and a half and Janusgraph is the successor?
>> >> Titan has become unofficially Datastax Enterprise Graph?!
>> >> Supporting Titan could become difficult because Titan does not support
>> >> my knowledge after TinkerPop 3 and is no longer maintained.
>> >>
>> >> I like your idea for a wiki page for more ideas. In the many mails one
>> >> loses oneself otherwise.
>> >>
>> >> Regards,
>> >> Kay-Uwe
>> >>
>> >> Am 12.05.2018 um 16:52 schrieb Matt Burgess:
>> >>> All,
>> >>>
>> >>> As Joe implied, I'm very happy that we are discussing graph tech in
>> >>> relation to NiFi! NiFi and Graph theory/tech/analytics are passions of
>> >>> mine. Mike, the examples you list are great, I would add Titan (and
>> >>> its fork Janusgraph as Kay-Uwe mentioned) and Azure CosmosDB (these
>> >>> and others are at [1]). I think there are at least four aspects to
>> >>> this:
>> >>>
>> >>> 1) Graph query/traversal: This deals with getting data out of a graph
>> >>> database and into flow file(s) for further processing. Here I agree
>> >>> with Kay-Uwe that we should consider Apache Tinkerpop as the main
>> >>> library for graph query/traversal, for a few reasons. The first as
>> >>> Kay-Uwe said is that there are many adapters for Tinkerpop (TP) to
>> >>> connect to various databases, from Mike's list I believe ArangoDB is
>> >>> the only one that does not yet have a TP adapter. The second is
>> >>> informed by the first, TP is a standard interface and graph traversal
>> >>> engine with a common DSL in Gremlin. A third is that Gremlin is a
>> >>> Groovy-based DSL, and Groovy syntax is fairly close to Java 8+ syntax
>> >>> and you can call Groovy/Gremlin from Java and vice versa. A third is
>> >>> that Tinkerpop is an Apache TLP with a very active and vibrant
>> >>> community, so we will be able to reap the benefits of all the graph
>> >>> goodness they develop moving forward. I think a QueryGraph processor
>> >>> could be appropriate, perhaps with a GraphDBConnectionPool controller
>> >>> service or something of the like. Apache DBCP can't do the pooling for
>> >>> us, but we could implement something similar to that for pooling TP
>> >>> connections.
>> >>>
>> >>> 2) Graph ingest: This one IMO is the long pole in the tent. Gremlin is
>> >>> a graph traversal language, and although its API has addVertex() and
>> >>> addEdge() methods and such, it seems like an inefficient solution,
>> >>> akin to using individual INSERTs in an RDBMS rather than a
>> >>> PreparedStatement or a bulk load. Keeping the analogy, bulk loading in
>> >>> RDBMSs is usually specific to that DB, and the same goes for graphs.
>> >>> The Titan-based ones have Titan-Hadoop (formerly Faunus), Neo4j has
>> >>> external tools (not sure if there's a Java API or not) and Cypher,
>> >>> OrientDB has an ETL pipeline system, etc. If we have a standard Graph
>> >>> concept, we could have controller services / writers that are
>> >>> system-specific (see aspect #4).
>> >>>
>> >>> 3) Arbitrary data -> Graph: Converting non-graph data into a graph
>> >>> almost always takes domain knowledge, which NiFi itself won't have and
>> >>> will thus have to be provided by the user. We'd need to make it as
>> >>> simple as possible but also as powerful and flexible as possible in
>> >>> order to get the most value. We can investigate how each of the
>> >>> systems in aspect #2 approaches this, and perhaps come up with a good
>> >>> user experience around it.
>> >>>
>> >>> 4) Organization and implementation: I think we should make sure to
>> >>> keep the capabilities very loosely coupled in terms of which
>> >>> modules/NARs/JARs provide which capabilities, to allow for maximum
>> >>> flexibility and ease of future development. I would prefer an
>> >>> API/libraries module akin to nifi-hadoop-libraries-nar, which would
>> >>> only include Apache Tinkerpop and any dependencies needed to do "pure"
>> >>> graph stuff, so probably no TP adapters except tinkergraph (and/or its
>> >>> faster fork from ShiftLeft [2]). The reason I say that is so NiFi
>> >>> components (and even the framework!) could use graphs in a lightweight
>> >>> manner, without lots of heavy and possibly unnecessary dependencies.
>> >>> Imagine being able to query your own flows using Gremlin or Cypher! I
>> >>> also envision an API much like the Record API in NiFi but for graphs,
>> >>> so we'd have GraphReaders and GraphWriters perhaps, they could convert
>> >>> from GraphML to GraphSON or Kryo for example, or in conjunction with a
>> >>> ConvertRecordToGraph processor, could be used to support the
>> >>> capability in aspect #3 above. I'd also be looking at bringing in
>> >>> Gremlin to the scripting processors, or having a Gremlin based
>> >>> scripting bundle as NiFi's graph capabilities mature.
>> >>>
>> >>> You might be able to tell I'm excited about this discussion ;) Should
>> >>> we get a Wiki page going for ideas, and/or keep it going here, or
>> >>> something else? I'm all ears for thoughts, questions, and ideas
>> >>> (especially the ones that might seem crazy!)
>> >>>
>> >>> Regards,
>> >>> Matt
>> >>>
>> >>> [1] http://tinkerpop.apache.org/providers.html
>> >>> [2] https://github.com/ShiftLeftSecurity/tinkergraph-gremlin
>> >>>
>> >>> On Sat, May 12, 2018 at 8:02 AM, Uwe@Moosheimer.com <
>> Uwe@moosheimer.com>
>> >> wrote:
>> >>>> Hi Mike,
>> >>>>
>> >>>> graph database support is not quite as easy as it seems.
>> >>>> Unlike relational databases, graphs have not only defined vertices
>> and
>> >> edges (labeled vertices and edges), they are directed or not and might
>> have
>> >> attributes at the nodes and edges, too.
>> >>>> This makes it a bit confusing for a general interface.
>> >>>>
>> >>>> In general, a graph database should always be accessed via TinkerPop
>> 3
>> >> (or higher), since every professional graph database supports
>> TinkerPop.
>> >>>> TinkerPop is for graph databases what jdbc is for relational
>> databases.
>> >>>>
>> >>>> I tried to create a general NiFi processor for graph databases myself
>> >> and then quit.
>> >>>> Unlike relational databases, graph databases usually have many
>> >> dependencies.
>> >>>> You do not simply create a data set but search for a particular
>> vertex
>> >> (which may still have certain edges) and create further edges and
>> vertices
>> >> at that.
>> >>>> And the search for the correct node is usually context-related.
>> >>>>
>> >>>> This makes it difficult to do something general for all requirements.
>> >>>>
>> >>>> In any case I am looking forward to your concept and how you want to
>> >> solve it.
>> >>>> It's definitely a good idea but hard to solve.
>> >>>>
>> >>>> Btw.: You forgot the most important graph database - Janusgraph.
>> >>>>
>> >>>> Mit freundlichen Grüßen / best regards
>> >>>> Kay-Uwe Moosheimer
>> >>>>
>> >>>>> Am 12.05.2018 um 13:01 schrieb Mike Thomsen <mikerthomsen@gmail.com
>> >:
>> >>>>>
>> >>>>> I was wondering if anyone on the dev list had given much thought to
>> >> graph
>> >>>>> database support in NiFi. There are a lot of graph databases out
>> there,
>> >> and
>> >>>>> many of them seem to be half-baked or barely supported. Narrowing it
>> >> down,
>> >>>>> it looks like the best candidates for a no fuss, decent sized graph
>> >> that we
>> >>>>> could build up with NiFi processors would be OrientDB, Neo4J and
>> >> ArangoDB.
>> >>>>> The first two are particularly attractive because they offer JDBC
>> >> drivers
>> >>>>> which opens the potential to making them even part of the standard
>> >>>>> JDBC-based processors.
>> >>>>>
>> >>>>> Anyone have any opinions or insights on this issue? I might have to
>> do
>> >>>>> OrientDB anyway, but if someone has a good feel for the market and
>> can
>> >> make
>> >>>>> recommendations that would be appreciated.
>> >>>>>
>> >>>>> Thanks,
>> >>>>>
>> >>>>> Mike
>>
>>
>>

Re: Graph database support w/ NiFi

Posted by Mike Thomsen <mi...@gmail.com>.

We have a Neo4J processor in a PR, but it is very much tied to Neo4J and
Cypher. I was raising the issue that we might want to take that PR and
extend it into an "ExecuteCypherQuery" processor with controller services
that use either cypher for gremlin or the neo4j driver.

On Sun, Oct 14, 2018 at 6:03 AM Uwe@Moosheimer.com <Uw...@moosheimer.com>
wrote:

> Mike,
>
> Cypher for Gremlin is a good idea. We can start with it and then later
> allow an alternative so that users can use either Cypher or Gremlin
> directly.
>
> To set the focus on Neo4J or Janusgraph or xyz is in my opinion not
> target-oriented.
> We should have a NiFi Graph processor that supports Tinkerpop. Via the
> Gremlin server we can support all Tinkerpop capable graph databases
> (
> https://github.com/apache/tinkerpop/blob/master/gremlin-server/conf/gremlin-server-neo4j.yaml
> ).
>
> Via a controller service we can then connect either Neo4J or Janusgraph
> or any other graph DB.
> Otherwise we would have to build a processor for each Graph DB.
> We don't do that in NiFi for RDBMS either. There we have an ExecuteSQL
> or PutSQL and say about the controller service what we want to connect.
>
> What do you mean Mike?
>
> Best Regards,
> Uwe
>
> Am 06.10.2018 um 00:15 schrieb Mike Thomsen:
> > Uwe and Matt,
> >
> > Now that we're dipping our toes into Neo4J and Cypher, any thoughts on
> this?
> >
> > https://github.com/opencypher/cypher-for-gremlin
> >
> > I'm wondering if we shouldn't work with mans2singh to take the Neo4J work
> > and push it further into having a client API that can let us inject a
> > service that uses that or one that uses Neo4J's drivers.
> >
> > Mike
> >
> > On Mon, May 14, 2018 at 7:13 AM Otto Fowler <ot...@gmail.com>
> wrote:
> >
> >> The wiki discussion should list these and other points of concern and
> >> should document the extent to which
> >> they are to be addressed.
> >>
> >>
> >> On May 12, 2018 at 12:37:59, Uwe@Moosheimer.com (uwe@moosheimer.com)
> >> wrote:
> >>
> >> Matt,
> >>
> >> You have some interesting ideas that I really like.
> >> GraphReaders and GraphWriters would be interesting. When I started
> >> writing a graph processor with my idea, the concept was not yet
> >> implemented in NiFi.
> >> I don't find GraphML and GraphSON so tingly because they contain e.g.
> >> the Vertex/Edge IDs and serve as import and export format to my
> >> knowledge (correct me if I'm wrong).
> >>
> >> A ConvertRecordToGraph processor is a good approach, the only question
> >> is from which format we can convert?
> >>
> >> I also think to make a graph processor a bit general we would have to
> >> provide a query as input which provides the correct vertex from which
> >> the graph should be extended.
> >> Maybe like your suggestion with a gremlin query or a small gremlin
> script.
> >>
> >> If a vertex is found a new edge and a new vertex are added.
> >> It asks how we transmit the individual attributes to the edge and vertex
> >> as well as the labels of the edge and vertex? Possibly with NiFi
> >> attributes?
> >>
> >> I have some headaches about the complexity.
> >> A small example:
> >> Imagine we have a set from a CSV file.
> >> The columns are Set ID, Token1, Token2, Token3...
> >> ID, Token1,Token2,Token3,Token4,Token5
> >> 123, Mary, had, a, little, lamp
> >>
> >> I want to create a vertex with ID 123 (if not exists). Then I want to
> >> check for each token if a vertex exists in the graph database (search
> >> for vertex with label "Token" and attribute "name"="Mary"). If the
> >> vertex does not exist, the vertex has to be created.
> >> Since I want to save e.g. Wikipedia to my graph I want to avoid the
> >> supernode problem for the token vertices. I create a few distribution
> >> vertices for each vertex that belongs to a token. If there is a vertex
> >> for Token1(Mary) then I don't want to make the edge from this vertex to
> >> my vertex with the ID 123, but from one of the distribution vertices.
> >> If the vertex for the token does not exist, the distribution vertices
> >> have also to be created ... and so on...
> >>
> >> Even with this very simple example it seems to become difficult with a
> >> universal processor.
> >>
> >> In any case I think the idea to implement a graph processor in NiFi is a
> >> good one.
> >> The more we work on it the more good ideas we get and maybe only I can't
> >> see the forest for the trees.
> >>
> >> One question about Titan. To my knowledge, Titan has been dead for a
> >> year and a half and Janusgraph is the successor?
> >> Titan has become unofficially Datastax Enterprise Graph?!
> >> Supporting Titan could become difficult because Titan does not support
> >> my knowledge after TinkerPop 3 and is no longer maintained.
> >>
> >> I like your idea for a wiki page for more ideas. In the many mails one
> >> loses oneself otherwise.
> >>
> >> Regards,
> >> Kay-Uwe
> >>
> >> Am 12.05.2018 um 16:52 schrieb Matt Burgess:
> >>> All,
> >>>
> >>> As Joe implied, I'm very happy that we are discussing graph tech in
> >>> relation to NiFi! NiFi and Graph theory/tech/analytics are passions of
> >>> mine. Mike, the examples you list are great, I would add Titan (and
> >>> its fork Janusgraph as Kay-Uwe mentioned) and Azure CosmosDB (these
> >>> and others are at [1]). I think there are at least four aspects to
> >>> this:
> >>>
> >>> 1) Graph query/traversal: This deals with getting data out of a graph
> >>> database and into flow file(s) for further processing. Here I agree
> >>> with Kay-Uwe that we should consider Apache Tinkerpop as the main
> >>> library for graph query/traversal, for a few reasons. The first as
> >>> Kay-Uwe said is that there are many adapters for Tinkerpop (TP) to
> >>> connect to various databases, from Mike's list I believe ArangoDB is
> >>> the only one that does not yet have a TP adapter. The second is
> >>> informed by the first, TP is a standard interface and graph traversal
> >>> engine with a common DSL in Gremlin. A third is that Gremlin is a
> >>> Groovy-based DSL, and Groovy syntax is fairly close to Java 8+ syntax
> >>> and you can call Groovy/Gremlin from Java and vice versa. A third is
> >>> that Tinkerpop is an Apache TLP with a very active and vibrant
> >>> community, so we will be able to reap the benefits of all the graph
> >>> goodness they develop moving forward. I think a QueryGraph processor
> >>> could be appropriate, perhaps with a GraphDBConnectionPool controller
> >>> service or something of the like. Apache DBCP can't do the pooling for
> >>> us, but we could implement something similar to that for pooling TP
> >>> connections.
> >>>
> >>> 2) Graph ingest: This one IMO is the long pole in the tent. Gremlin is
> >>> a graph traversal language, and although its API has addVertex() and
> >>> addEdge() methods and such, it seems like an inefficient solution,
> >>> akin to using individual INSERTs in an RDBMS rather than a
> >>> PreparedStatement or a bulk load. Keeping the analogy, bulk loading in
> >>> RDBMSs is usually specific to that DB, and the same goes for graphs.
> >>> The Titan-based ones have Titan-Hadoop (formerly Faunus), Neo4j has
> >>> external tools (not sure if there's a Java API or not) and Cypher,
> >>> OrientDB has an ETL pipeline system, etc. If we have a standard Graph
> >>> concept, we could have controller services / writers that are
> >>> system-specific (see aspect #4).
> >>>
> >>> 3) Arbitrary data -> Graph: Converting non-graph data into a graph
> >>> almost always takes domain knowledge, which NiFi itself won't have and
> >>> will thus have to be provided by the user. We'd need to make it as
> >>> simple as possible but also as powerful and flexible as possible in
> >>> order to get the most value. We can investigate how each of the
> >>> systems in aspect #2 approaches this, and perhaps come up with a good
> >>> user experience around it.
> >>>
> >>> 4) Organization and implementation: I think we should make sure to
> >>> keep the capabilities very loosely coupled in terms of which
> >>> modules/NARs/JARs provide which capabilities, to allow for maximum
> >>> flexibility and ease of future development. I would prefer an
> >>> API/libraries module akin to nifi-hadoop-libraries-nar, which would
> >>> only include Apache Tinkerpop and any dependencies needed to do "pure"
> >>> graph stuff, so probably no TP adapters except tinkergraph (and/or its
> >>> faster fork from ShiftLeft [2]). The reason I say that is so NiFi
> >>> components (and even the framework!) could use graphs in a lightweight
> >>> manner, without lots of heavy and possibly unnecessary dependencies.
> >>> Imagine being able to query your own flows using Gremlin or Cypher! I
> >>> also envision an API much like the Record API in NiFi but for graphs,
> >>> so we'd have GraphReaders and GraphWriters perhaps, they could convert
> >>> from GraphML to GraphSON or Kryo for example, or in conjunction with a
> >>> ConvertRecordToGraph processor, could be used to support the
> >>> capability in aspect #3 above. I'd also be looking at bringing in
> >>> Gremlin to the scripting processors, or having a Gremlin based
> >>> scripting bundle as NiFi's graph capabilities mature.
> >>>
> >>> You might be able to tell I'm excited about this discussion ;) Should
> >>> we get a Wiki page going for ideas, and/or keep it going here, or
> >>> something else? I'm all ears for thoughts, questions, and ideas
> >>> (especially the ones that might seem crazy!)
> >>>
> >>> Regards,
> >>> Matt
> >>>
> >>> [1] http://tinkerpop.apache.org/providers.html
> >>> [2] https://github.com/ShiftLeftSecurity/tinkergraph-gremlin
> >>>
> >>> On Sat, May 12, 2018 at 8:02 AM, Uwe@Moosheimer.com <
> Uwe@moosheimer.com>
> >> wrote:
> >>>> Hi Mike,
> >>>>
> >>>> graph database support is not quite as easy as it seems.
> >>>> Unlike relational databases, graphs have not only defined vertices and
> >> edges (labeled vertices and edges), they are directed or not and might
> have
> >> attributes at the nodes and edges, too.
> >>>> This makes it a bit confusing for a general interface.
> >>>>
> >>>> In general, a graph database should always be accessed via TinkerPop 3
> >> (or higher), since every professional graph database supports TinkerPop.
> >>>> TinkerPop is for graph databases what jdbc is for relational
> databases.
> >>>>
> >>>> I tried to create a general NiFi processor for graph databases myself
> >> and then quit.
> >>>> Unlike relational databases, graph databases usually have many
> >> dependencies.
> >>>> You do not simply create a data set but search for a particular vertex
> >> (which may still have certain edges) and create further edges and
> vertices
> >> at that.
> >>>> And the search for the correct node is usually context-related.
> >>>>
> >>>> This makes it difficult to do something general for all requirements.
> >>>>
> >>>> In any case I am looking forward to your concept and how you want to
> >> solve it.
> >>>> It's definitely a good idea but hard to solve.
> >>>>
> >>>> Btw.: You forgot the most important graph database - Janusgraph.
> >>>>
> >>>> Mit freundlichen Grüßen / best regards
> >>>> Kay-Uwe Moosheimer
> >>>>
> >>>>> Am 12.05.2018 um 13:01 schrieb Mike Thomsen <mikerthomsen@gmail.com
> >:
> >>>>>
> >>>>> I was wondering if anyone on the dev list had given much thought to
> >> graph
> >>>>> database support in NiFi. There are a lot of graph databases out
> there,
> >> and
> >>>>> many of them seem to be half-baked or barely supported. Narrowing it
> >> down,
> >>>>> it looks like the best candidates for a no fuss, decent sized graph
> >> that we
> >>>>> could build up with NiFi processors would be OrientDB, Neo4J and
> >> ArangoDB.
> >>>>> The first two are particularly attractive because they offer JDBC
> >> drivers
> >>>>> which opens the potential to making them even part of the standard
> >>>>> JDBC-based processors.
> >>>>>
> >>>>> Anyone have any opinions or insights on this issue? I might have to
> do
> >>>>> OrientDB anyway, but if someone has a good feel for the market and
> can
> >> make
> >>>>> recommendations that would be appreciated.
> >>>>>
> >>>>> Thanks,
> >>>>>
> >>>>> Mike
>
>
>

Re: Graph database support w/ NiFi

Posted by "Uwe@Moosheimer.com" <Uw...@Moosheimer.com>.

Mike,

Cypher for Gremlin is a good idea. We can start with it and then later
allow an alternative so that users can use either Cypher or Gremlin
directly.

To set the focus on Neo4J or Janusgraph or xyz is in my opinion not
target-oriented.
We should have a NiFi Graph processor that supports Tinkerpop. Via the
Gremlin server we can support all Tinkerpop capable graph databases
(https://github.com/apache/tinkerpop/blob/master/gremlin-server/conf/gremlin-server-neo4j.yaml).

Via a controller service we can then connect either Neo4J or Janusgraph
or any other graph DB.
Otherwise we would have to build a processor for each Graph DB.
We don't do that in NiFi for RDBMS either. There we have an ExecuteSQL
or PutSQL and say about the controller service what we want to connect.

What do you mean Mike?

Best Regards,
Uwe

Am 06.10.2018 um 00:15 schrieb Mike Thomsen:
> Uwe and Matt,
>
> Now that we're dipping our toes into Neo4J and Cypher, any thoughts on this?
>
> https://github.com/opencypher/cypher-for-gremlin
>
> I'm wondering if we shouldn't work with mans2singh to take the Neo4J work
> and push it further into having a client API that can let us inject a
> service that uses that or one that uses Neo4J's drivers.
>
> Mike
>
> On Mon, May 14, 2018 at 7:13 AM Otto Fowler <ot...@gmail.com> wrote:
>
>> The wiki discussion should list these and other points of concern and
>> should document the extent to which
>> they are to be addressed.
>>
>>
>> On May 12, 2018 at 12:37:59, Uwe@Moosheimer.com (uwe@moosheimer.com)
>> wrote:
>>
>> Matt,
>>
>> You have some interesting ideas that I really like.
>> GraphReaders and GraphWriters would be interesting. When I started
>> writing a graph processor with my idea, the concept was not yet
>> implemented in NiFi.
>> I don't find GraphML and GraphSON so tingly because they contain e.g.
>> the Vertex/Edge IDs and serve as import and export format to my
>> knowledge (correct me if I'm wrong).
>>
>> A ConvertRecordToGraph processor is a good approach, the only question
>> is from which format we can convert?
>>
>> I also think to make a graph processor a bit general we would have to
>> provide a query as input which provides the correct vertex from which
>> the graph should be extended.
>> Maybe like your suggestion with a gremlin query or a small gremlin script.
>>
>> If a vertex is found a new edge and a new vertex are added.
>> It asks how we transmit the individual attributes to the edge and vertex
>> as well as the labels of the edge and vertex? Possibly with NiFi
>> attributes?
>>
>> I have some headaches about the complexity.
>> A small example:
>> Imagine we have a set from a CSV file.
>> The columns are Set ID, Token1, Token2, Token3...
>> ID, Token1,Token2,Token3,Token4,Token5
>> 123, Mary, had, a, little, lamp
>>
>> I want to create a vertex with ID 123 (if not exists). Then I want to
>> check for each token if a vertex exists in the graph database (search
>> for vertex with label "Token" and attribute "name"="Mary"). If the
>> vertex does not exist, the vertex has to be created.
>> Since I want to save e.g. Wikipedia to my graph I want to avoid the
>> supernode problem for the token vertices. I create a few distribution
>> vertices for each vertex that belongs to a token. If there is a vertex
>> for Token1(Mary) then I don't want to make the edge from this vertex to
>> my vertex with the ID 123, but from one of the distribution vertices.
>> If the vertex for the token does not exist, the distribution vertices
>> have also to be created ... and so on...
>>
>> Even with this very simple example it seems to become difficult with a
>> universal processor.
>>
>> In any case I think the idea to implement a graph processor in NiFi is a
>> good one.
>> The more we work on it the more good ideas we get and maybe only I can't
>> see the forest for the trees.
>>
>> One question about Titan. To my knowledge, Titan has been dead for a
>> year and a half and Janusgraph is the successor?
>> Titan has become unofficially Datastax Enterprise Graph?!
>> Supporting Titan could become difficult because Titan does not support
>> my knowledge after TinkerPop 3 and is no longer maintained.
>>
>> I like your idea for a wiki page for more ideas. In the many mails one
>> loses oneself otherwise.
>>
>> Regards,
>> Kay-Uwe
>>
>> Am 12.05.2018 um 16:52 schrieb Matt Burgess:
>>> All,
>>>
>>> As Joe implied, I'm very happy that we are discussing graph tech in
>>> relation to NiFi! NiFi and Graph theory/tech/analytics are passions of
>>> mine. Mike, the examples you list are great, I would add Titan (and
>>> its fork Janusgraph as Kay-Uwe mentioned) and Azure CosmosDB (these
>>> and others are at [1]). I think there are at least four aspects to
>>> this:
>>>
>>> 1) Graph query/traversal: This deals with getting data out of a graph
>>> database and into flow file(s) for further processing. Here I agree
>>> with Kay-Uwe that we should consider Apache Tinkerpop as the main
>>> library for graph query/traversal, for a few reasons. The first as
>>> Kay-Uwe said is that there are many adapters for Tinkerpop (TP) to
>>> connect to various databases, from Mike's list I believe ArangoDB is
>>> the only one that does not yet have a TP adapter. The second is
>>> informed by the first, TP is a standard interface and graph traversal
>>> engine with a common DSL in Gremlin. A third is that Gremlin is a
>>> Groovy-based DSL, and Groovy syntax is fairly close to Java 8+ syntax
>>> and you can call Groovy/Gremlin from Java and vice versa. A third is
>>> that Tinkerpop is an Apache TLP with a very active and vibrant
>>> community, so we will be able to reap the benefits of all the graph
>>> goodness they develop moving forward. I think a QueryGraph processor
>>> could be appropriate, perhaps with a GraphDBConnectionPool controller
>>> service or something of the like. Apache DBCP can't do the pooling for
>>> us, but we could implement something similar to that for pooling TP
>>> connections.
>>>
>>> 2) Graph ingest: This one IMO is the long pole in the tent. Gremlin is
>>> a graph traversal language, and although its API has addVertex() and
>>> addEdge() methods and such, it seems like an inefficient solution,
>>> akin to using individual INSERTs in an RDBMS rather than a
>>> PreparedStatement or a bulk load. Keeping the analogy, bulk loading in
>>> RDBMSs is usually specific to that DB, and the same goes for graphs.
>>> The Titan-based ones have Titan-Hadoop (formerly Faunus), Neo4j has
>>> external tools (not sure if there's a Java API or not) and Cypher,
>>> OrientDB has an ETL pipeline system, etc. If we have a standard Graph
>>> concept, we could have controller services / writers that are
>>> system-specific (see aspect #4).
>>>
>>> 3) Arbitrary data -> Graph: Converting non-graph data into a graph
>>> almost always takes domain knowledge, which NiFi itself won't have and
>>> will thus have to be provided by the user. We'd need to make it as
>>> simple as possible but also as powerful and flexible as possible in
>>> order to get the most value. We can investigate how each of the
>>> systems in aspect #2 approaches this, and perhaps come up with a good
>>> user experience around it.
>>>
>>> 4) Organization and implementation: I think we should make sure to
>>> keep the capabilities very loosely coupled in terms of which
>>> modules/NARs/JARs provide which capabilities, to allow for maximum
>>> flexibility and ease of future development. I would prefer an
>>> API/libraries module akin to nifi-hadoop-libraries-nar, which would
>>> only include Apache Tinkerpop and any dependencies needed to do "pure"
>>> graph stuff, so probably no TP adapters except tinkergraph (and/or its
>>> faster fork from ShiftLeft [2]). The reason I say that is so NiFi
>>> components (and even the framework!) could use graphs in a lightweight
>>> manner, without lots of heavy and possibly unnecessary dependencies.
>>> Imagine being able to query your own flows using Gremlin or Cypher! I
>>> also envision an API much like the Record API in NiFi but for graphs,
>>> so we'd have GraphReaders and GraphWriters perhaps, they could convert
>>> from GraphML to GraphSON or Kryo for example, or in conjunction with a
>>> ConvertRecordToGraph processor, could be used to support the
>>> capability in aspect #3 above. I'd also be looking at bringing in
>>> Gremlin to the scripting processors, or having a Gremlin based
>>> scripting bundle as NiFi's graph capabilities mature.
>>>
>>> You might be able to tell I'm excited about this discussion ;) Should
>>> we get a Wiki page going for ideas, and/or keep it going here, or
>>> something else? I'm all ears for thoughts, questions, and ideas
>>> (especially the ones that might seem crazy!)
>>>
>>> Regards,
>>> Matt
>>>
>>> [1] http://tinkerpop.apache.org/providers.html
>>> [2] https://github.com/ShiftLeftSecurity/tinkergraph-gremlin
>>>
>>> On Sat, May 12, 2018 at 8:02 AM, Uwe@Moosheimer.com <Uw...@moosheimer.com>
>> wrote:
>>>> Hi Mike,
>>>>
>>>> graph database support is not quite as easy as it seems.
>>>> Unlike relational databases, graphs have not only defined vertices and
>> edges (labeled vertices and edges), they are directed or not and might have
>> attributes at the nodes and edges, too.
>>>> This makes it a bit confusing for a general interface.
>>>>
>>>> In general, a graph database should always be accessed via TinkerPop 3
>> (or higher), since every professional graph database supports TinkerPop.
>>>> TinkerPop is for graph databases what jdbc is for relational databases.
>>>>
>>>> I tried to create a general NiFi processor for graph databases myself
>> and then quit.
>>>> Unlike relational databases, graph databases usually have many
>> dependencies.
>>>> You do not simply create a data set but search for a particular vertex
>> (which may still have certain edges) and create further edges and vertices
>> at that.
>>>> And the search for the correct node is usually context-related.
>>>>
>>>> This makes it difficult to do something general for all requirements.
>>>>
>>>> In any case I am looking forward to your concept and how you want to
>> solve it.
>>>> It's definitely a good idea but hard to solve.
>>>>
>>>> Btw.: You forgot the most important graph database - Janusgraph.
>>>>
>>>> Mit freundlichen Grüßen / best regards
>>>> Kay-Uwe Moosheimer
>>>>
>>>>> Am 12.05.2018 um 13:01 schrieb Mike Thomsen <mi...@gmail.com>:
>>>>>
>>>>> I was wondering if anyone on the dev list had given much thought to
>> graph
>>>>> database support in NiFi. There are a lot of graph databases out there,
>> and
>>>>> many of them seem to be half-baked or barely supported. Narrowing it
>> down,
>>>>> it looks like the best candidates for a no fuss, decent sized graph
>> that we
>>>>> could build up with NiFi processors would be OrientDB, Neo4J and
>> ArangoDB.
>>>>> The first two are particularly attractive because they offer JDBC
>> drivers
>>>>> which opens the potential to making them even part of the standard
>>>>> JDBC-based processors.
>>>>>
>>>>> Anyone have any opinions or insights on this issue? I might have to do
>>>>> OrientDB anyway, but if someone has a good feel for the market and can
>> make
>>>>> recommendations that would be appreciated.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Mike

Re: Graph database support w/ NiFi

Posted by Mike Thomsen <mi...@gmail.com>.

Uwe and Matt,

Now that we're dipping our toes into Neo4J and Cypher, any thoughts on this?

https://github.com/opencypher/cypher-for-gremlin

I'm wondering if we shouldn't work with mans2singh to take the Neo4J work
and push it further into having a client API that can let us inject a
service that uses that or one that uses Neo4J's drivers.

Mike

On Mon, May 14, 2018 at 7:13 AM Otto Fowler <ot...@gmail.com> wrote:

> The wiki discussion should list these and other points of concern and
> should document the extent to which
> they are to be addressed.
>
>
> On May 12, 2018 at 12:37:59, Uwe@Moosheimer.com (uwe@moosheimer.com)
> wrote:
>
> Matt,
>
> You have some interesting ideas that I really like.
> GraphReaders and GraphWriters would be interesting. When I started
> writing a graph processor with my idea, the concept was not yet
> implemented in NiFi.
> I don't find GraphML and GraphSON so tingly because they contain e.g.
> the Vertex/Edge IDs and serve as import and export format to my
> knowledge (correct me if I'm wrong).
>
> A ConvertRecordToGraph processor is a good approach, the only question
> is from which format we can convert?
>
> I also think to make a graph processor a bit general we would have to
> provide a query as input which provides the correct vertex from which
> the graph should be extended.
> Maybe like your suggestion with a gremlin query or a small gremlin script.
>
> If a vertex is found a new edge and a new vertex are added.
> It asks how we transmit the individual attributes to the edge and vertex
> as well as the labels of the edge and vertex? Possibly with NiFi
> attributes?
>
> I have some headaches about the complexity.
> A small example:
> Imagine we have a set from a CSV file.
> The columns are Set ID, Token1, Token2, Token3...
> ID, Token1,Token2,Token3,Token4,Token5
> 123, Mary, had, a, little, lamp
>
> I want to create a vertex with ID 123 (if not exists). Then I want to
> check for each token if a vertex exists in the graph database (search
> for vertex with label "Token" and attribute "name"="Mary"). If the
> vertex does not exist, the vertex has to be created.
> Since I want to save e.g. Wikipedia to my graph I want to avoid the
> supernode problem for the token vertices. I create a few distribution
> vertices for each vertex that belongs to a token. If there is a vertex
> for Token1(Mary) then I don't want to make the edge from this vertex to
> my vertex with the ID 123, but from one of the distribution vertices.
> If the vertex for the token does not exist, the distribution vertices
> have also to be created ... and so on...
>
> Even with this very simple example it seems to become difficult with a
> universal processor.
>
> In any case I think the idea to implement a graph processor in NiFi is a
> good one.
> The more we work on it the more good ideas we get and maybe only I can't
> see the forest for the trees.
>
> One question about Titan. To my knowledge, Titan has been dead for a
> year and a half and Janusgraph is the successor?
> Titan has become unofficially Datastax Enterprise Graph?!
> Supporting Titan could become difficult because Titan does not support
> my knowledge after TinkerPop 3 and is no longer maintained.
>
> I like your idea for a wiki page for more ideas. In the many mails one
> loses oneself otherwise.
>
> Regards,
> Kay-Uwe
>
> Am 12.05.2018 um 16:52 schrieb Matt Burgess:
> > All,
> >
> > As Joe implied, I'm very happy that we are discussing graph tech in
> > relation to NiFi! NiFi and Graph theory/tech/analytics are passions of
> > mine. Mike, the examples you list are great, I would add Titan (and
> > its fork Janusgraph as Kay-Uwe mentioned) and Azure CosmosDB (these
> > and others are at [1]). I think there are at least four aspects to
> > this:
> >
> > 1) Graph query/traversal: This deals with getting data out of a graph
> > database and into flow file(s) for further processing. Here I agree
> > with Kay-Uwe that we should consider Apache Tinkerpop as the main
> > library for graph query/traversal, for a few reasons. The first as
> > Kay-Uwe said is that there are many adapters for Tinkerpop (TP) to
> > connect to various databases, from Mike's list I believe ArangoDB is
> > the only one that does not yet have a TP adapter. The second is
> > informed by the first, TP is a standard interface and graph traversal
> > engine with a common DSL in Gremlin. A third is that Gremlin is a
> > Groovy-based DSL, and Groovy syntax is fairly close to Java 8+ syntax
> > and you can call Groovy/Gremlin from Java and vice versa. A third is
> > that Tinkerpop is an Apache TLP with a very active and vibrant
> > community, so we will be able to reap the benefits of all the graph
> > goodness they develop moving forward. I think a QueryGraph processor
> > could be appropriate, perhaps with a GraphDBConnectionPool controller
> > service or something of the like. Apache DBCP can't do the pooling for
> > us, but we could implement something similar to that for pooling TP
> > connections.
> >
> > 2) Graph ingest: This one IMO is the long pole in the tent. Gremlin is
> > a graph traversal language, and although its API has addVertex() and
> > addEdge() methods and such, it seems like an inefficient solution,
> > akin to using individual INSERTs in an RDBMS rather than a
> > PreparedStatement or a bulk load. Keeping the analogy, bulk loading in
> > RDBMSs is usually specific to that DB, and the same goes for graphs.
> > The Titan-based ones have Titan-Hadoop (formerly Faunus), Neo4j has
> > external tools (not sure if there's a Java API or not) and Cypher,
> > OrientDB has an ETL pipeline system, etc. If we have a standard Graph
> > concept, we could have controller services / writers that are
> > system-specific (see aspect #4).
> >
> > 3) Arbitrary data -> Graph: Converting non-graph data into a graph
> > almost always takes domain knowledge, which NiFi itself won't have and
> > will thus have to be provided by the user. We'd need to make it as
> > simple as possible but also as powerful and flexible as possible in
> > order to get the most value. We can investigate how each of the
> > systems in aspect #2 approaches this, and perhaps come up with a good
> > user experience around it.
> >
> > 4) Organization and implementation: I think we should make sure to
> > keep the capabilities very loosely coupled in terms of which
> > modules/NARs/JARs provide which capabilities, to allow for maximum
> > flexibility and ease of future development. I would prefer an
> > API/libraries module akin to nifi-hadoop-libraries-nar, which would
> > only include Apache Tinkerpop and any dependencies needed to do "pure"
> > graph stuff, so probably no TP adapters except tinkergraph (and/or its
> > faster fork from ShiftLeft [2]). The reason I say that is so NiFi
> > components (and even the framework!) could use graphs in a lightweight
> > manner, without lots of heavy and possibly unnecessary dependencies.
> > Imagine being able to query your own flows using Gremlin or Cypher! I
> > also envision an API much like the Record API in NiFi but for graphs,
> > so we'd have GraphReaders and GraphWriters perhaps, they could convert
> > from GraphML to GraphSON or Kryo for example, or in conjunction with a
> > ConvertRecordToGraph processor, could be used to support the
> > capability in aspect #3 above. I'd also be looking at bringing in
> > Gremlin to the scripting processors, or having a Gremlin based
> > scripting bundle as NiFi's graph capabilities mature.
> >
> > You might be able to tell I'm excited about this discussion ;) Should
> > we get a Wiki page going for ideas, and/or keep it going here, or
> > something else? I'm all ears for thoughts, questions, and ideas
> > (especially the ones that might seem crazy!)
> >
> > Regards,
> > Matt
> >
> > [1] http://tinkerpop.apache.org/providers.html
> > [2] https://github.com/ShiftLeftSecurity/tinkergraph-gremlin
> >
> > On Sat, May 12, 2018 at 8:02 AM, Uwe@Moosheimer.com <Uw...@moosheimer.com>
> wrote:
> >> Hi Mike,
> >>
> >> graph database support is not quite as easy as it seems.
> >> Unlike relational databases, graphs have not only defined vertices and
> edges (labeled vertices and edges), they are directed or not and might have
> attributes at the nodes and edges, too.
> >>
> >> This makes it a bit confusing for a general interface.
> >>
> >> In general, a graph database should always be accessed via TinkerPop 3
> (or higher), since every professional graph database supports TinkerPop.
> >> TinkerPop is for graph databases what jdbc is for relational databases.
> >>
> >> I tried to create a general NiFi processor for graph databases myself
> and then quit.
> >> Unlike relational databases, graph databases usually have many
> dependencies.
> >>
> >> You do not simply create a data set but search for a particular vertex
> (which may still have certain edges) and create further edges and vertices
> at that.
> >> And the search for the correct node is usually context-related.
> >>
> >> This makes it difficult to do something general for all requirements.
> >>
> >> In any case I am looking forward to your concept and how you want to
> solve it.
> >> It's definitely a good idea but hard to solve.
> >>
> >> Btw.: You forgot the most important graph database - Janusgraph.
> >>
> >> Mit freundlichen Grüßen / best regards
> >> Kay-Uwe Moosheimer
> >>
> >>> Am 12.05.2018 um 13:01 schrieb Mike Thomsen <mi...@gmail.com>:
> >>>
> >>> I was wondering if anyone on the dev list had given much thought to
> graph
> >>> database support in NiFi. There are a lot of graph databases out there,
> and
> >>> many of them seem to be half-baked or barely supported. Narrowing it
> down,
> >>> it looks like the best candidates for a no fuss, decent sized graph
> that we
> >>> could build up with NiFi processors would be OrientDB, Neo4J and
> ArangoDB.
> >>> The first two are particularly attractive because they offer JDBC
> drivers
> >>> which opens the potential to making them even part of the standard
> >>> JDBC-based processors.
> >>>
> >>> Anyone have any opinions or insights on this issue? I might have to do
> >>> OrientDB anyway, but if someone has a good feel for the market and can
> make
> >>> recommendations that would be appreciated.
> >>>
> >>> Thanks,
> >>>
> >>> Mike
>

Re: Graph database support w/ NiFi

Posted by Otto Fowler <ot...@gmail.com>.

The wiki discussion should list these and other points of concern and
should document the extent to which
they are to be addressed.


On May 12, 2018 at 12:37:59, Uwe@Moosheimer.com (uwe@moosheimer.com) wrote:

Matt,

You have some interesting ideas that I really like.
GraphReaders and GraphWriters would be interesting. When I started
writing a graph processor with my idea, the concept was not yet
implemented in NiFi.
I don't find GraphML and GraphSON so tingly because they contain e.g.
the Vertex/Edge IDs and serve as import and export format to my
knowledge (correct me if I'm wrong).

A ConvertRecordToGraph processor is a good approach, the only question
is from which format we can convert?

I also think to make a graph processor a bit general we would have to
provide a query as input which provides the correct vertex from which
the graph should be extended.
Maybe like your suggestion with a gremlin query or a small gremlin script.

If a vertex is found a new edge and a new vertex are added.
It asks how we transmit the individual attributes to the edge and vertex
as well as the labels of the edge and vertex? Possibly with NiFi
attributes?

I have some headaches about the complexity.
A small example:
Imagine we have a set from a CSV file.
The columns are Set ID, Token1, Token2, Token3...
ID, Token1,Token2,Token3,Token4,Token5
123, Mary, had, a, little, lamp

I want to create a vertex with ID 123 (if not exists). Then I want to
check for each token if a vertex exists in the graph database (search
for vertex with label "Token" and attribute "name"="Mary"). If the
vertex does not exist, the vertex has to be created.
Since I want to save e.g. Wikipedia to my graph I want to avoid the
supernode problem for the token vertices. I create a few distribution
vertices for each vertex that belongs to a token. If there is a vertex
for Token1(Mary) then I don't want to make the edge from this vertex to
my vertex with the ID 123, but from one of the distribution vertices.
If the vertex for the token does not exist, the distribution vertices
have also to be created ... and so on...

Even with this very simple example it seems to become difficult with a
universal processor.

In any case I think the idea to implement a graph processor in NiFi is a
good one.
The more we work on it the more good ideas we get and maybe only I can't
see the forest for the trees.

One question about Titan. To my knowledge, Titan has been dead for a
year and a half and Janusgraph is the successor?
Titan has become unofficially Datastax Enterprise Graph?!
Supporting Titan could become difficult because Titan does not support
my knowledge after TinkerPop 3 and is no longer maintained.

I like your idea for a wiki page for more ideas. In the many mails one
loses oneself otherwise.

Regards,
Kay-Uwe

Am 12.05.2018 um 16:52 schrieb Matt Burgess:
> All,
>
> As Joe implied, I'm very happy that we are discussing graph tech in
> relation to NiFi! NiFi and Graph theory/tech/analytics are passions of
> mine. Mike, the examples you list are great, I would add Titan (and
> its fork Janusgraph as Kay-Uwe mentioned) and Azure CosmosDB (these
> and others are at [1]). I think there are at least four aspects to
> this:
>
> 1) Graph query/traversal: This deals with getting data out of a graph
> database and into flow file(s) for further processing. Here I agree
> with Kay-Uwe that we should consider Apache Tinkerpop as the main
> library for graph query/traversal, for a few reasons. The first as
> Kay-Uwe said is that there are many adapters for Tinkerpop (TP) to
> connect to various databases, from Mike's list I believe ArangoDB is
> the only one that does not yet have a TP adapter. The second is
> informed by the first, TP is a standard interface and graph traversal
> engine with a common DSL in Gremlin. A third is that Gremlin is a
> Groovy-based DSL, and Groovy syntax is fairly close to Java 8+ syntax
> and you can call Groovy/Gremlin from Java and vice versa. A third is
> that Tinkerpop is an Apache TLP with a very active and vibrant
> community, so we will be able to reap the benefits of all the graph
> goodness they develop moving forward. I think a QueryGraph processor
> could be appropriate, perhaps with a GraphDBConnectionPool controller
> service or something of the like. Apache DBCP can't do the pooling for
> us, but we could implement something similar to that for pooling TP
> connections.
>
> 2) Graph ingest: This one IMO is the long pole in the tent. Gremlin is
> a graph traversal language, and although its API has addVertex() and
> addEdge() methods and such, it seems like an inefficient solution,
> akin to using individual INSERTs in an RDBMS rather than a
> PreparedStatement or a bulk load. Keeping the analogy, bulk loading in
> RDBMSs is usually specific to that DB, and the same goes for graphs.
> The Titan-based ones have Titan-Hadoop (formerly Faunus), Neo4j has
> external tools (not sure if there's a Java API or not) and Cypher,
> OrientDB has an ETL pipeline system, etc. If we have a standard Graph
> concept, we could have controller services / writers that are
> system-specific (see aspect #4).
>
> 3) Arbitrary data -> Graph: Converting non-graph data into a graph
> almost always takes domain knowledge, which NiFi itself won't have and
> will thus have to be provided by the user. We'd need to make it as
> simple as possible but also as powerful and flexible as possible in
> order to get the most value. We can investigate how each of the
> systems in aspect #2 approaches this, and perhaps come up with a good
> user experience around it.
>
> 4) Organization and implementation: I think we should make sure to
> keep the capabilities very loosely coupled in terms of which
> modules/NARs/JARs provide which capabilities, to allow for maximum
> flexibility and ease of future development. I would prefer an
> API/libraries module akin to nifi-hadoop-libraries-nar, which would
> only include Apache Tinkerpop and any dependencies needed to do "pure"
> graph stuff, so probably no TP adapters except tinkergraph (and/or its
> faster fork from ShiftLeft [2]). The reason I say that is so NiFi
> components (and even the framework!) could use graphs in a lightweight
> manner, without lots of heavy and possibly unnecessary dependencies.
> Imagine being able to query your own flows using Gremlin or Cypher! I
> also envision an API much like the Record API in NiFi but for graphs,
> so we'd have GraphReaders and GraphWriters perhaps, they could convert
> from GraphML to GraphSON or Kryo for example, or in conjunction with a
> ConvertRecordToGraph processor, could be used to support the
> capability in aspect #3 above. I'd also be looking at bringing in
> Gremlin to the scripting processors, or having a Gremlin based
> scripting bundle as NiFi's graph capabilities mature.
>
> You might be able to tell I'm excited about this discussion ;) Should
> we get a Wiki page going for ideas, and/or keep it going here, or
> something else? I'm all ears for thoughts, questions, and ideas
> (especially the ones that might seem crazy!)
>
> Regards,
> Matt
>
> [1] http://tinkerpop.apache.org/providers.html
> [2] https://github.com/ShiftLeftSecurity/tinkergraph-gremlin
>
> On Sat, May 12, 2018 at 8:02 AM, Uwe@Moosheimer.com <Uw...@moosheimer.com>
wrote:
>> Hi Mike,
>>
>> graph database support is not quite as easy as it seems.
>> Unlike relational databases, graphs have not only defined vertices and
edges (labeled vertices and edges), they are directed or not and might have
attributes at the nodes and edges, too.
>>
>> This makes it a bit confusing for a general interface.
>>
>> In general, a graph database should always be accessed via TinkerPop 3
(or higher), since every professional graph database supports TinkerPop.
>> TinkerPop is for graph databases what jdbc is for relational databases.
>>
>> I tried to create a general NiFi processor for graph databases myself
and then quit.
>> Unlike relational databases, graph databases usually have many
dependencies.
>>
>> You do not simply create a data set but search for a particular vertex
(which may still have certain edges) and create further edges and vertices
at that.
>> And the search for the correct node is usually context-related.
>>
>> This makes it difficult to do something general for all requirements.
>>
>> In any case I am looking forward to your concept and how you want to
solve it.
>> It's definitely a good idea but hard to solve.
>>
>> Btw.: You forgot the most important graph database - Janusgraph.
>>
>> Mit freundlichen Grüßen / best regards
>> Kay-Uwe Moosheimer
>>
>>> Am 12.05.2018 um 13:01 schrieb Mike Thomsen <mi...@gmail.com>:
>>>
>>> I was wondering if anyone on the dev list had given much thought to
graph
>>> database support in NiFi. There are a lot of graph databases out there,
and
>>> many of them seem to be half-baked or barely supported. Narrowing it
down,
>>> it looks like the best candidates for a no fuss, decent sized graph
that we
>>> could build up with NiFi processors would be OrientDB, Neo4J and
ArangoDB.
>>> The first two are particularly attractive because they offer JDBC
drivers
>>> which opens the potential to making them even part of the standard
>>> JDBC-based processors.
>>>
>>> Anyone have any opinions or insights on this issue? I might have to do
>>> OrientDB anyway, but if someone has a good feel for the market and can
make
>>> recommendations that would be appreciated.
>>>
>>> Thanks,
>>>
>>> Mike

Re: Graph database support w/ NiFi

Posted by "Uwe@Moosheimer.com" <Uw...@Moosheimer.com>.

Matt,

You have some interesting ideas that I really like.
GraphReaders and GraphWriters would be interesting. When I started
writing a graph processor with my idea, the concept was not yet
implemented in NiFi.
I don't find GraphML and GraphSON so tingly because they contain e.g.
the Vertex/Edge IDs and serve as import and export format to my
knowledge (correct me if I'm wrong).

A ConvertRecordToGraph processor is a good approach, the only question
is from which format we can convert?

I also think to make a graph processor a bit general we would have to
provide a query as input which provides the correct vertex from which
the graph should be extended.
Maybe like your suggestion with a gremlin query or a small gremlin script.

If a vertex is found a new edge and a new vertex are added.
It asks how we transmit the individual attributes to the edge and vertex
as well as the labels of the edge and vertex? Possibly with NiFi attributes?

I have some headaches about the complexity.
A small example:
Imagine we have a set from a CSV file.
The columns are Set ID, Token1, Token2, Token3...
ID, Token1,Token2,Token3,Token4,Token5
123, Mary, had, a, little, lamp

I want to create a vertex with ID 123 (if not exists). Then I want to
check for each token if a vertex exists in the graph database (search
for vertex with label "Token" and attribute "name"="Mary"). If the
vertex does not exist, the vertex has to be created.
Since I want to save e.g. Wikipedia to my graph I want to avoid the
supernode problem for the token vertices. I create a few distribution
vertices for each vertex that belongs to a token. If there is a vertex
for Token1(Mary) then I don't want to make the edge from this vertex to
my vertex with the ID 123, but from one of the distribution vertices.
If the vertex for the token does not exist, the distribution vertices
have also to be created ... and so on...

Even with this very simple example it seems to become difficult with a
universal processor.

In any case I think the idea to implement a graph processor in NiFi is a
good one.
The more we work on it the more good ideas we get and maybe only I can't
see the forest for the trees.

One question about Titan. To my knowledge, Titan has been dead for a
year and a half and Janusgraph is the successor?
Titan has become unofficially Datastax Enterprise Graph?!
Supporting Titan could become difficult because Titan does not support
my knowledge after TinkerPop 3 and is no longer maintained.

I like your idea for a wiki page for more ideas. In the many mails one
loses oneself otherwise.

Regards,
Kay-Uwe

Am 12.05.2018 um 16:52 schrieb Matt Burgess:
> All,
>
> As Joe implied, I'm very happy that we are discussing graph tech in
> relation to NiFi! NiFi and Graph theory/tech/analytics are passions of
> mine. Mike, the examples you list are great, I would add Titan (and
> its fork Janusgraph as Kay-Uwe mentioned) and Azure CosmosDB (these
> and others are at [1]). I think there are at least four aspects to
> this:
>
> 1) Graph query/traversal: This deals with getting data out of a graph
> database and into flow file(s) for further processing. Here I agree
> with Kay-Uwe that we should consider Apache Tinkerpop as the main
> library for graph query/traversal, for a few reasons. The first as
> Kay-Uwe said is that there are many adapters for Tinkerpop (TP) to
> connect to various databases, from Mike's list I believe ArangoDB is
> the only one that does not yet have a TP adapter. The second is
> informed by the first, TP is a standard interface and graph traversal
> engine with a common DSL in Gremlin. A third is that Gremlin is a
> Groovy-based DSL, and Groovy syntax is fairly close to Java 8+ syntax
> and you can call Groovy/Gremlin from Java and vice versa. A third is
> that Tinkerpop is an Apache TLP with a very active and vibrant
> community, so we will be able to reap the benefits of all the graph
> goodness they develop moving forward.  I think a QueryGraph processor
> could be appropriate, perhaps with a GraphDBConnectionPool controller
> service or something of the like. Apache DBCP can't do the pooling for
> us, but we could implement something similar to that for pooling TP
> connections.
>
> 2) Graph ingest: This one IMO is the long pole in the tent. Gremlin is
> a graph traversal language, and although its API has addVertex() and
> addEdge() methods and such, it seems like an inefficient solution,
> akin to using individual INSERTs in an RDBMS rather than a
> PreparedStatement or a bulk load. Keeping the analogy, bulk loading in
> RDBMSs is usually specific to that DB, and the same goes for graphs.
> The Titan-based ones have Titan-Hadoop (formerly Faunus), Neo4j has
> external tools (not sure if there's a Java API or not) and Cypher,
> OrientDB has an ETL pipeline system, etc.  If we have a standard Graph
> concept, we could have controller services / writers that are
> system-specific (see aspect #4).
>
> 3) Arbitrary data -> Graph: Converting non-graph data into a graph
> almost always takes domain knowledge, which NiFi itself won't have and
> will thus have to be provided by the user. We'd need to make it as
> simple as possible but also as powerful and flexible as possible in
> order to get the most value. We can investigate how each of the
> systems in aspect #2 approaches this, and perhaps come up with a good
> user experience around it.
>
> 4) Organization and implementation:  I think we should make sure to
> keep the capabilities very loosely coupled in terms of which
> modules/NARs/JARs provide which capabilities, to allow for maximum
> flexibility and ease of future development.  I would prefer an
> API/libraries module akin to nifi-hadoop-libraries-nar, which would
> only include Apache Tinkerpop and any dependencies needed to do "pure"
> graph stuff, so probably no TP adapters except tinkergraph (and/or its
> faster fork from ShiftLeft [2]). The reason I say that is so NiFi
> components (and even the framework!) could use graphs in a lightweight
> manner, without lots of heavy and possibly unnecessary dependencies.
> Imagine being able to query your own flows using Gremlin or Cypher!  I
> also envision an API much like the Record API in NiFi but for graphs,
> so we'd have GraphReaders and GraphWriters perhaps, they could convert
> from GraphML to GraphSON or Kryo for example, or in conjunction with a
> ConvertRecordToGraph processor, could be used to support the
> capability in aspect #3 above.  I'd also be looking at bringing in
> Gremlin to the scripting processors, or having a Gremlin based
> scripting bundle as NiFi's graph capabilities mature.
>
> You might be able to tell I'm excited about this discussion ;)  Should
> we get a Wiki page going for ideas, and/or keep it going here, or
> something else?  I'm all ears for thoughts, questions, and ideas
> (especially the ones that might seem crazy!)
>
> Regards,
> Matt
>
> [1] http://tinkerpop.apache.org/providers.html
> [2] https://github.com/ShiftLeftSecurity/tinkergraph-gremlin
>
> On Sat, May 12, 2018 at 8:02 AM, Uwe@Moosheimer.com <Uw...@moosheimer.com> wrote:
>> Hi Mike,
>>
>> graph database support is not quite as easy as it seems.
>> Unlike relational databases, graphs have not only defined vertices and edges (labeled vertices and edges), they are directed or not and might have attributes at the nodes and edges, too.
>>
>> This makes it a bit confusing for a general interface.
>>
>> In general, a graph database should always be accessed via TinkerPop 3 (or higher), since every professional graph database supports TinkerPop.
>> TinkerPop is for graph databases what jdbc is for relational databases.
>>
>> I tried to create a general NiFi processor for graph databases myself and then quit.
>> Unlike relational databases, graph databases usually have many dependencies.
>>
>> You do not simply create a data set but search for a particular vertex (which may still have certain edges) and create further edges and vertices at that.
>> And the search for the correct node is usually context-related.
>>
>> This makes it difficult to do something general for all requirements.
>>
>> In any case I am looking forward to your concept and how you want to solve it.
>> It's definitely a good idea but hard to solve.
>>
>> Btw.: You forgot the most important graph database - Janusgraph.
>>
>> Mit freundlichen Grüßen / best regards
>> Kay-Uwe Moosheimer
>>
>>> Am 12.05.2018 um 13:01 schrieb Mike Thomsen <mi...@gmail.com>:
>>>
>>> I was wondering if anyone on the dev list had given much thought to graph
>>> database support in NiFi. There are a lot of graph databases out there, and
>>> many of them seem to be half-baked or barely supported. Narrowing it down,
>>> it looks like the best candidates for a no fuss, decent sized graph that we
>>> could build up with NiFi processors would be OrientDB, Neo4J and ArangoDB.
>>> The first two are particularly attractive because they offer JDBC drivers
>>> which opens the potential to making them even part of the standard
>>> JDBC-based processors.
>>>
>>> Anyone have any opinions or insights on this issue? I might have to do
>>> OrientDB anyway, but if someone has a good feel for the market and can make
>>> recommendations that would be appreciated.
>>>
>>> Thanks,
>>>
>>> Mike

Re: Graph database support w/ NiFi

Posted by Matt Burgess <ma...@apache.org>.

All,

As Joe implied, I'm very happy that we are discussing graph tech in
relation to NiFi! NiFi and Graph theory/tech/analytics are passions of
mine. Mike, the examples you list are great, I would add Titan (and
its fork Janusgraph as Kay-Uwe mentioned) and Azure CosmosDB (these
and others are at [1]). I think there are at least four aspects to
this:

1) Graph query/traversal: This deals with getting data out of a graph
database and into flow file(s) for further processing. Here I agree
with Kay-Uwe that we should consider Apache Tinkerpop as the main
library for graph query/traversal, for a few reasons. The first as
Kay-Uwe said is that there are many adapters for Tinkerpop (TP) to
connect to various databases, from Mike's list I believe ArangoDB is
the only one that does not yet have a TP adapter. The second is
informed by the first, TP is a standard interface and graph traversal
engine with a common DSL in Gremlin. A third is that Gremlin is a
Groovy-based DSL, and Groovy syntax is fairly close to Java 8+ syntax
and you can call Groovy/Gremlin from Java and vice versa. A third is
that Tinkerpop is an Apache TLP with a very active and vibrant
community, so we will be able to reap the benefits of all the graph
goodness they develop moving forward.  I think a QueryGraph processor
could be appropriate, perhaps with a GraphDBConnectionPool controller
service or something of the like. Apache DBCP can't do the pooling for
us, but we could implement something similar to that for pooling TP
connections.

2) Graph ingest: This one IMO is the long pole in the tent. Gremlin is
a graph traversal language, and although its API has addVertex() and
addEdge() methods and such, it seems like an inefficient solution,
akin to using individual INSERTs in an RDBMS rather than a
PreparedStatement or a bulk load. Keeping the analogy, bulk loading in
RDBMSs is usually specific to that DB, and the same goes for graphs.
The Titan-based ones have Titan-Hadoop (formerly Faunus), Neo4j has
external tools (not sure if there's a Java API or not) and Cypher,
OrientDB has an ETL pipeline system, etc.  If we have a standard Graph
concept, we could have controller services / writers that are
system-specific (see aspect #4).

3) Arbitrary data -> Graph: Converting non-graph data into a graph
almost always takes domain knowledge, which NiFi itself won't have and
will thus have to be provided by the user. We'd need to make it as
simple as possible but also as powerful and flexible as possible in
order to get the most value. We can investigate how each of the
systems in aspect #2 approaches this, and perhaps come up with a good
user experience around it.

4) Organization and implementation:  I think we should make sure to
keep the capabilities very loosely coupled in terms of which
modules/NARs/JARs provide which capabilities, to allow for maximum
flexibility and ease of future development.  I would prefer an
API/libraries module akin to nifi-hadoop-libraries-nar, which would
only include Apache Tinkerpop and any dependencies needed to do "pure"
graph stuff, so probably no TP adapters except tinkergraph (and/or its
faster fork from ShiftLeft [2]). The reason I say that is so NiFi
components (and even the framework!) could use graphs in a lightweight
manner, without lots of heavy and possibly unnecessary dependencies.
Imagine being able to query your own flows using Gremlin or Cypher!  I
also envision an API much like the Record API in NiFi but for graphs,
so we'd have GraphReaders and GraphWriters perhaps, they could convert
from GraphML to GraphSON or Kryo for example, or in conjunction with a
ConvertRecordToGraph processor, could be used to support the
capability in aspect #3 above.  I'd also be looking at bringing in
Gremlin to the scripting processors, or having a Gremlin based
scripting bundle as NiFi's graph capabilities mature.

You might be able to tell I'm excited about this discussion ;)  Should
we get a Wiki page going for ideas, and/or keep it going here, or
something else?  I'm all ears for thoughts, questions, and ideas
(especially the ones that might seem crazy!)

Regards,
Matt

[1] http://tinkerpop.apache.org/providers.html
[2] https://github.com/ShiftLeftSecurity/tinkergraph-gremlin

On Sat, May 12, 2018 at 8:02 AM, Uwe@Moosheimer.com <Uw...@moosheimer.com> wrote:
> Hi Mike,
>
> graph database support is not quite as easy as it seems.
> Unlike relational databases, graphs have not only defined vertices and edges (labeled vertices and edges), they are directed or not and might have attributes at the nodes and edges, too.
>
> This makes it a bit confusing for a general interface.
>
> In general, a graph database should always be accessed via TinkerPop 3 (or higher), since every professional graph database supports TinkerPop.
> TinkerPop is for graph databases what jdbc is for relational databases.
>
> I tried to create a general NiFi processor for graph databases myself and then quit.
> Unlike relational databases, graph databases usually have many dependencies.
>
> You do not simply create a data set but search for a particular vertex (which may still have certain edges) and create further edges and vertices at that.
> And the search for the correct node is usually context-related.
>
> This makes it difficult to do something general for all requirements.
>
> In any case I am looking forward to your concept and how you want to solve it.
> It's definitely a good idea but hard to solve.
>
> Btw.: You forgot the most important graph database - Janusgraph.
>
> Mit freundlichen Grüßen / best regards
> Kay-Uwe Moosheimer
>
>> Am 12.05.2018 um 13:01 schrieb Mike Thomsen <mi...@gmail.com>:
>>
>> I was wondering if anyone on the dev list had given much thought to graph
>> database support in NiFi. There are a lot of graph databases out there, and
>> many of them seem to be half-baked or barely supported. Narrowing it down,
>> it looks like the best candidates for a no fuss, decent sized graph that we
>> could build up with NiFi processors would be OrientDB, Neo4J and ArangoDB.
>> The first two are particularly attractive because they offer JDBC drivers
>> which opens the potential to making them even part of the standard
>> JDBC-based processors.
>>
>> Anyone have any opinions or insights on this issue? I might have to do
>> OrientDB anyway, but if someone has a good feel for the market and can make
>> recommendations that would be appreciated.
>>
>> Thanks,
>>
>> Mike
>

Re: Graph database support w/ NiFi

Posted by "Uwe@Moosheimer.com" <Uw...@Moosheimer.com>.

Hi Mike,

graph database support is not quite as easy as it seems.
Unlike relational databases, graphs have not only defined vertices and edges (labeled vertices and edges), they are directed or not and might have attributes at the nodes and edges, too.

This makes it a bit confusing for a general interface. 

In general, a graph database should always be accessed via TinkerPop 3 (or higher), since every professional graph database supports TinkerPop.
TinkerPop is for graph databases what jdbc is for relational databases.

I tried to create a general NiFi processor for graph databases myself and then quit.
Unlike relational databases, graph databases usually have many dependencies.

You do not simply create a data set but search for a particular vertex (which may still have certain edges) and create further edges and vertices at that.
And the search for the correct node is usually context-related. 

This makes it difficult to do something general for all requirements.

In any case I am looking forward to your concept and how you want to solve it.
It's definitely a good idea but hard to solve.

Btw.: You forgot the most important graph database - Janusgraph.

Mit freundlichen Grüßen / best regards
Kay-Uwe Moosheimer

> Am 12.05.2018 um 13:01 schrieb Mike Thomsen <mi...@gmail.com>:
> 
> I was wondering if anyone on the dev list had given much thought to graph
> database support in NiFi. There are a lot of graph databases out there, and
> many of them seem to be half-baked or barely supported. Narrowing it down,
> it looks like the best candidates for a no fuss, decent sized graph that we
> could build up with NiFi processors would be OrientDB, Neo4J and ArangoDB.
> The first two are particularly attractive because they offer JDBC drivers
> which opens the potential to making them even part of the standard
> JDBC-based processors.
> 
> Anyone have any opinions or insights on this issue? I might have to do
> OrientDB anyway, but if someone has a good feel for the market and can make
> recommendations that would be appreciated.
> 
> Thanks,
> 
> Mike