You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@flume.apache.org by Saikat Kanjilal <sx...@hotmail.com> on 2016/07/01 05:16:45 UTC

RE: [Discuss graph source/sink design proposal]

So I've started the coding efforts on this, here's some details:
1) I've cloned the hbase sink for now and am refactoring all of that code to work with neo4j as a start2) I'm only focusing on creating a sink that will perform basic CRUD streaming operations into neo4j3) I've sent an email to the neo4j guys to figure out details around building a streaming architecture with the neo4j kernel4) In the meantime how would you guys like to review the code, I've cloned the flume repo and have created a branch called flume-2035 where I will work, should I put all the code in bitbucket and send out periodic reviews, this is going to be a sizeable effort5) How should we think about cipher related workflows as it relates to the streaming data coming in , to see a ful flavor for cipher go here https://neo4j.com/developer/cypher-query-language/

Would love to get some discussion going on 2-5.
Thanks

> From: mpercy@apache.org
> Date: Wed, 29 Jun 2016 17:24:16 -0700
> Subject: Re: [Discuss graph source/sink design proposal]
> To: dev@flume.apache.org
> 
> Hmm, maybe a different Kudu project? Not sure.
> 
> Anyway, this type of "changelog" thing would require support in the DB for
> streaming its write-ahead log or something. For example, we don't support
> that in Apache Kudu (incubating) -- maybe someday.
> 
> Regarding Flume, I usually think it's useful to distinguish between a
> source and a sink. They are typically written as separate classes and they
> represent different interfaces at the Flume Java API level.
> 
> So, how would one write a streaming database source? That really depends on
> the database and the APIs it provides for that.
> 
> Mike
> 
> On Tue, Jun 28, 2016 at 8:30 AM, Saikat Kanjilal <sx...@hotmail.com>
> wrote:
> 
> > :) I'm using Kudu at work at the moment to troubleshoot some Tomcat
> > issues,  regarding the where to keep the source code I would say for now
> > lets go with the plugin approach and revisit the "where does the code live"
> > conversation later.  One thing I do want to discuss is that the plugin will
> > act as a source or a sink depending on configuration, so if the plugin acts
> > as a source we need a mechanism (like a daemon in syslog) to stream changes
> > real time from a graphdb into flume, I was wondering if there are any past
> > approaches around this that I can follow, I may need to dig into the neo4j
> > kernel to see where we can inject something like this.
> > Thoughts on that?
> >
> > > From: mpercy@apache.org
> > > Date: Tue, 28 Jun 2016 00:27:45 -0700
> > > Subject: Re: [Discuss graph source/sink design proposal]
> > > To: dev@flume.apache.org
> > >
> > > Hi Saikat,
> > > Please see my thoughts inline. This is how I think about this stuff;
> > others
> > > may think about it differently.
> > >
> > > On Mon, Jun 27, 2016 at 8:45 PM, Saikat Kanjilal <sx...@hotmail.com>
> > > wrote:
> > >
> > > > Exactly right, I'm proposing we create a graph sink for flume while
> > > > keeping the flume core intact.
> > >
> > >
> > > As you are probably aware, sources and sinks don't have to be part of the
> > > main Apache Flume source tree to be used with Flume. The plugins.d
> > > mechanism described in [1] makes building and integrating separate
> > plugins
> > > into Flume an easy thing to do at deployment time.
> > >
> > > In another project I work on, Apache Kudu (incubating), we have a Flume
> > > Kudu sink committed in the main source tree [2]. We may at some point
> > > propose to move it into the Flume source tree, but for now (for testing
> > and
> > > API stability reasons) it's easier to keep it in the Kudu source tree.
> > >
> > > Likewise, you could implement a Flume Neo4J sink and post it up on GitHub
> > > (or maybe in the Neo4J tree?). Donating it to the Apache Flume project
> > once
> > > it's in decent shape may make sense at some point, especially if the
> > > dependencies are easy to share and integrate into the Flume project.
> > > However, I wouldn't say that it's a foregone conclusion that it really
> > > needs to be part of the Flume source tree. Assuming you need the sink,
> > and
> > > are going to implement it anyway, then maybe we can defer the discussion
> > of
> > > whether to include it in the Flume source tree until later. One of the
> > > things I try to keep in mind when integrating new plugin code is whether
> > > the project will be able to support the maintenance burden of the new
> > code.
> > >
> > > In reading from a graph db we need a mechanism to stream data from the
> > > > graph store into flume.
> > > >
> > >
> > > Yes, I'd say it could potentially make sense to create a Flume Neo4J
> > source
> > > as well. I think the same logic as above would still apply.
> > >
> > > Regards,
> > > Mike
> > >
> > > [1]
> > >
> > https://flume.apache.org/FlumeUserGuide.html#installing-third-party-plugins
> > > [2]
> > >
> > https://github.com/apache/incubator-kudu/tree/master/java/kudu-flume-sink
> >
> >

Re: [Discuss graph source/sink design proposal]

Posted by Saikat Kanjilal <sx...@hotmail.com>.

I wanted to add one other thing, I'll be looking at installing a server extension inside neo4j so that I can inject streaming data into neo4j through wrapping the curl extensions in java through the flume sink.

The plugin below will be installed in neo4j as a server extension after which I will start querying neo4j.

https://github.com/jexp/streaming-cypher

[https://avatars1.githubusercontent.com/u/67427?v=3&s=400]<https://github.com/jexp/streaming-cypher>

GitHub - jexp/streaming-cypher: Neo4j Server Extension for ...<https://github.com/jexp/streaming-cypher>
github.com
README.md Streaming Cypher. Just put the jar (streaming-cypher-extension-1.7.M03.jar) into neo4j-server/plugins and add this to the conf/neo4j-server.properties file

Also would love to hear some thoughts from y'all around how we should review this code (i.e. as a patch or as partial pull requests etc)

Thanks

________________________________
From: Saikat Kanjilal <sx...@hotmail.com>
Sent: Thursday, June 30, 2016 10:16 PM
To: dev@flume.apache.org
Subject: RE: [Discuss graph source/sink design proposal]

So I've started the coding efforts on this, here's some details:
1) I've cloned the hbase sink for now and am refactoring all of that code to work with neo4j as a start2) I'm only focusing on creating a sink that will perform basic CRUD streaming operations into neo4j3) I've sent an email to the neo4j guys to figure out details around building a streaming architecture with the neo4j kernel4) In the meantime how would you guys like to review the code, I've cloned the flume repo and have created a branch called flume-2035 where I will work, should I put all the code in bitbucket and send out periodic reviews, this is going to be a sizeable effort5) How should we think about cipher related workflows as it relates to the streaming data coming in , to see a ful flavor for cipher go here https://neo4j.com/developer/cypher-query-language/
Neo4j's Graph Query Language: An Introduction to Cypher<https://neo4j.com/developer/cypher-query-language/>
neo4j.com
Master the basics of Cypher – the graph query language for Neo4j – with this introductory guide that teaches you how to read and write Cypher queries.

Would love to get some discussion going on 2-5.
Thanks

> From: mpercy@apache.org
> Date: Wed, 29 Jun 2016 17:24:16 -0700
> Subject: Re: [Discuss graph source/sink design proposal]
> To: dev@flume.apache.org
>
> Hmm, maybe a different Kudu project? Not sure.
>
> Anyway, this type of "changelog" thing would require support in the DB for
> streaming its write-ahead log or something. For example, we don't support
> that in Apache Kudu (incubating) -- maybe someday.
>
> Regarding Flume, I usually think it's useful to distinguish between a
> source and a sink. They are typically written as separate classes and they
> represent different interfaces at the Flume Java API level.
>
> So, how would one write a streaming database source? That really depends on
> the database and the APIs it provides for that.
>
> Mike
>
> On Tue, Jun 28, 2016 at 8:30 AM, Saikat Kanjilal <sx...@hotmail.com>
> wrote:
>
> > :) I'm using Kudu at work at the moment to troubleshoot some Tomcat
> > issues,  regarding the where to keep the source code I would say for now
> > lets go with the plugin approach and revisit the "where does the code live"
> > conversation later.  One thing I do want to discuss is that the plugin will
> > act as a source or a sink depending on configuration, so if the plugin acts
> > as a source we need a mechanism (like a daemon in syslog) to stream changes
> > real time from a graphdb into flume, I was wondering if there are any past
> > approaches around this that I can follow, I may need to dig into the neo4j
> > kernel to see where we can inject something like this.
> > Thoughts on that?
> >
> > > From: mpercy@apache.org
> > > Date: Tue, 28 Jun 2016 00:27:45 -0700
> > > Subject: Re: [Discuss graph source/sink design proposal]
> > > To: dev@flume.apache.org
> > >
> > > Hi Saikat,
> > > Please see my thoughts inline. This is how I think about this stuff;
> > others
> > > may think about it differently.
> > >
> > > On Mon, Jun 27, 2016 at 8:45 PM, Saikat Kanjilal <sx...@hotmail.com>
> > > wrote:
> > >
> > > > Exactly right, I'm proposing we create a graph sink for flume while
> > > > keeping the flume core intact.
> > >
> > >
> > > As you are probably aware, sources and sinks don't have to be part of the
> > > main Apache Flume source tree to be used with Flume. The plugins.d
> > > mechanism described in [1] makes building and integrating separate
> > plugins
> > > into Flume an easy thing to do at deployment time.
> > >
> > > In another project I work on, Apache Kudu (incubating), we have a Flume
> > > Kudu sink committed in the main source tree [2]. We may at some point
> > > propose to move it into the Flume source tree, but for now (for testing
> > and
> > > API stability reasons) it's easier to keep it in the Kudu source tree.
> > >
> > > Likewise, you could implement a Flume Neo4J sink and post it up on GitHub
> > > (or maybe in the Neo4J tree?). Donating it to the Apache Flume project
> > once
> > > it's in decent shape may make sense at some point, especially if the
> > > dependencies are easy to share and integrate into the Flume project.
> > > However, I wouldn't say that it's a foregone conclusion that it really
> > > needs to be part of the Flume source tree. Assuming you need the sink,
> > and
> > > are going to implement it anyway, then maybe we can defer the discussion
> > of
> > > whether to include it in the Flume source tree until later. One of the
> > > things I try to keep in mind when integrating new plugin code is whether
> > > the project will be able to support the maintenance burden of the new
> > code.
> > >
> > > In reading from a graph db we need a mechanism to stream data from the
> > > > graph store into flume.
> > > >
> > >
> > > Yes, I'd say it could potentially make sense to create a Flume Neo4J
> > source
> > > as well. I think the same logic as above would still apply.
> > >
> > > Regards,
> > > Mike
> > >
> > > [1]
> > >
> > https://flume.apache.org/FlumeUserGuide.html#installing-third-party-plugins
> > > [2]
> > >
> > https://github.com/apache/incubator-kudu/tree/master/java/kudu-flume-sink
> >
> >

Re: [Discuss graph source/sink design proposal]

Posted by Saikat Kanjilal <sx...@hotmail.com>.

I suppose I'll need to do the same with neo4j.
Thanks

Sent from my iPhone

> On Jul 13, 2016, at 6:21 PM, Mike Percy <mp...@apache.org> wrote:
> 
> For the Flume-Kafka integration we start up Kafka mini clusters in the unit
> tests. It depends on the server. The project doesn't have any permanent
> infrastructure in place with long running servers.
> 
> Mike
> 
> On Wed, Jul 13, 2016 at 5:37 PM, Saikat Kanjilal <sx...@hotmail.com>
> wrote:
> 
>> Mike et al,
>> 
>> Out of curiosity how do committers usually run integration tests when
>> doing flume sink development, at some point I will have the graph sink
>> talking to neo4j and would really rather not have to test everything
>> locally as the performance of testing locally would make the whole
>> operation not really reflect the actual sink performance.  Any ideas on how
>> to get past this.  I'm not there yet but will be there in a few weeks where
>> I'll need to start perf/integration testing.
>> 
>> 
>> Thanks in advance.
>> 
>> 
>> ________________________________
>> From: Saikat Kanjilal <sx...@hotmail.com>
>> Sent: Saturday, July 9, 2016 8:16 AM
>> To: dev@flume.apache.org
>> Subject: Re: [Discuss graph source/sink design proposal]
>> 
>> Mike et al,
>> 
>> To clarify again I'm starting with the hbase sink and modifying it to
>> match the graph use case.  This si probably why you saw the hbase stuff
>> still left over.  In a nutshell the design will look like the following:
>> 
>> 
>> flume->neo4j (sink workflow)
>> 
>> We batch events up from flume, we use the neo4j bolt driver to convert the
>> batch of events into cipher statements and then we send the data in bulk
>> into neo4j, one open question here might be how many go in a batch and
>> should this be dynamically configurable
>> 
>> 
>> neo4j->flume (source workflow)
>> 
>> We add event listeners inside neo4j and then send data back into flume
>> through these listeners, although here we'd need to really be careful about
>> sending every single event, a batching strategy here might also make sense
>> but takes out the concept of real time updates
>> 
>> 
>> More later as I make more progress, also your criteria for acceptance of
>> this sink is no different than accepting contributions to any other open
>> source project , I guess I'd like to also know if there's interest from the
>> community in connecting flume with neo4j as that would generate more
>> feedback on the design.
>> 
>> Here's a blurb on the new neo4j java and other languages interface:
>> 
>> https://neo4j.com/blog/neo4j-3-0-language-drivers/
>> A Deeper Dive into Neo4j 3.0 Language Drivers<
>> https://neo4j.com/blog/neo4j-3-0-language-drivers/>
>> neo4j.com
>> Discover the four new language drivers for Neo4j 3.0 that provide easy
>> access to Neo4j through a uniform API, regardless of programming language.
>> 
>> 
>> 
>> 
>> 
>> Thanks
>> A Deeper Dive into Neo4j 3.0 Language Drivers<
>> https://neo4j.com/blog/neo4j-3-0-language-drivers/>
>> neo4j.com
>> Discover the four new language drivers for Neo4j 3.0 that provide easy
>> access to Neo4j through a uniform API, regardless of programming language.
>> 
>> 
>> 
>> 
>> ________________________________
>> From: Mike Percy <mp...@apache.org>
>> Sent: Friday, July 8, 2016 6:22 PM
>> To: dev@flume.apache.org
>> Subject: Re: [Discuss graph source/sink design proposal]
>> 
>> Hi Saikat, please see my responses inline.
>> 
>> On Thu, Jul 7, 2016 at 8:50 PM, Saikat Kanjilal <sx...@hotmail.com>
>> wrote:
>> 
>>> Ok moved the code to here:
>>> https://bitbucket.org/skanjila/flume-ng-graph-sink
>> [
>> https://d301sr5gafysq2.cloudfront.net/e5b75889441d/img/repo-avatars/default.svg
>> ]<https://bitbucket.org/skanjila/flume-ng-graph-sink>
>> 
>> skanjila / flume-ng-graph-sink<
>> https://bitbucket.org/skanjila/flume-ng-graph-sink>
>> bitbucket.org
>> Git repository hosted by Bitbucket.
>> 
>> 
>> 
>> 
>> 
>> It looks like mostly still HBaseSink code right now, just with a different
>> package name. I only looked at the Async one and that's what I found.
>> 
>> Also I am exploring using the https://github.com/neo4j/neo4j-java-driver
>> using
>>> the bolt protocol to connect to neo4j to stream events
>> 
>> I don't know anything about Neo4J personally. Unfortunately I don't have
>> time to really participate in development of this new sink using technology
>> I have no use for, myself. Maybe there are others on this list that have
>> the time and interest to help.
>> 
>> Looking forward to getting feedback on this effort as y'all have time.
>> 
>> I apologize for not having the time to provide much guidance beyond the
>> capabilities of Flume itself.
>> 
>> In the future, as a committer on Flume, I would personally consider merging
>> Neo4J support into the Flume source tree if the following conditions were
>> met:
>> 
>> 1. Strong feedback from others that this connector is desired by multiple
>> members of the community
>> 2. An implementation that is well designed, tested, and production-grade
>> 3. A likely long-term maintainer (maybe that is you?)
>> 
>> The reason I hesitate to add more integrations into the core is that if
>> this breaks, and someone is using it, we will have to fix it. If someone
>> asks a question on the mailing lists, we will have to attempt to answer it.
>> 
>> Regards,
>> Mike
>> 
>> 
>> From: Saikat Kanjilal <sx...@hotmail.com>
>>> Sent: Thursday, July 7, 2016 9:31 AM
>>> To: dev@flume.apache.org
>>> Subject: Re: [Discuss graph source/sink design proposal]
>>> 
>>> Would it be ok to use bitbucket instead?  I have indeed extended
>>> AbstractSink to build the graph sink, I will depend on flume-ng-core on
>> my
>>> pom as well.
>>> 
>>> Thanks and feel free to respond on the cipher discussion as well as the
>>> other items I mentioned earlier.
>>> 
>>> 
>>> ________________________________
>>> From: Mike Percy <mp...@apache.org>
>>> Sent: Monday, July 4, 2016 12:03 PM
>>> To: dev@flume.apache.org
>>> Subject: Re: [Discuss graph source/sink design proposal]
>>> 
>>> Hi Saikat,
>>> I recommend you use GitHub. Private branches in ASF repos are only
>>> available to committers.
>>> 
>>> Regarding forking Flume, you should not need to do that. Just depend on
>>> flume-ng-core in your pom and extend AbstractSink. Maven will pull in
>> your
>>> deps.
>>> 
>>> I'm out of town for the next few days but I'll try to respond in more
>>> detail to your design notes when I'm back in town.
>>> 
>>> Mike
>>> 
>>> Sent from my iPhone
>>> 
>>>> On Jul 4, 2016, at 6:59 AM, Saikat Kanjilal <sx...@hotmail.com>
>> wrote:
>>>> 
>>>> Hari/Mike et al,
>>>> 
>>>> I need a place to put interim checkins related to this work, is it
>>> possible to get write privileges into a private branch so that I can
>> commit
>>> my code at intermediate junctures, I can also put it in bitbucket but
>> would
>>> rather not have to create yet another place for the code to live if it'll
>>> eventually end up in the flume repo.
>>>> 
>>>> 
>>>> Thanks in advance
>>>> 
>>>> 
>>>> ________________________________
>>>> From: Saikat Kanjilal <sx...@hotmail.com>
>>>> Sent: Thursday, June 30, 2016 10:16 PM
>>>> To: dev@flume.apache.org
>>>> Subject: RE: [Discuss graph source/sink design proposal]
>>>> 
>>>> So I've started the coding efforts on this, here's some details:
>>>> 1) I've cloned the hbase sink for now and am refactoring all of that
>>> code to work with neo4j as a start2) I'm only focusing on creating a sink
>>> that will perform basic CRUD streaming operations into neo4j3) I've sent
>> an
>>> email to the neo4j guys to figure out details around building a streaming
>>> architecture with the neo4j kernel4) In the meantime how would you guys
>>> like to review the code, I've cloned the flume repo and have created a
>>> branch called flume-2035 where I will work, should I put all the code in
>>> bitbucket and send out periodic reviews, this is going to be a sizeable
>>> effort5) How should we think about cipher related workflows as it relates
>>> to the streaming data coming in , to see a ful flavor for cipher go here
>>> https://neo4j.com/developer/cypher-query-language/
>>> Neo4j's Graph Query Language: An Introduction to Cypher<
>>> https://neo4j.com/developer/cypher-query-language/>
>>> neo4j.com
>>> Master the basics of Cypher – the graph query language for Neo4j – with
>>> this introductory guide that teaches you how to read and write Cypher
>>> queries.
>>> 
>>> 
>>> 
>>> Neo4j's Graph Query Language: An Introduction to Cypher<
>>> https://neo4j.com/developer/cypher-query-language/>
>>> neo4j.com
>>> Master the basics of Cypher – the graph query language for Neo4j – with
>>> this introductory guide that teaches you how to read and write Cypher
>>> queries.
>>> 
>>> 
>>> 
>>>> Neo4j's Graph Query Language: An Introduction to Cypher<
>>> https://neo4j.com/developer/cypher-query-language/>
>>>> neo4j.com
>>>> Master the basics of Cypher – the graph query language for Neo4j – with
>>> this introductory guide that teaches you how to read and write Cypher
>>> queries.
>>>> 
>>>> 
>>>> 
>>>> 
>>>> Would love to get some discussion going on 2-5.
>>>> Thanks
>>>> 
>>>>> From: mpercy@apache.org
>>>>> Date: Wed, 29 Jun 2016 17:24:16 -0700
>>>>> Subject: Re: [Discuss graph source/sink design proposal]
>>>>> To: dev@flume.apache.org
>>>>> 
>>>>> Hmm, maybe a different Kudu project? Not sure.
>>>>> 
>>>>> Anyway, this type of "changelog" thing would require support in the DB
>>> for
>>>>> streaming its write-ahead log or something. For example, we don't
>>> support
>>>>> that in Apache Kudu (incubating) -- maybe someday.
>>>>> 
>>>>> Regarding Flume, I usually think it's useful to distinguish between a
>>>>> source and a sink. They are typically written as separate classes and
>>> they
>>>>> represent different interfaces at the Flume Java API level.
>>>>> 
>>>>> So, how would one write a streaming database source? That really
>>> depends on
>>>>> the database and the APIs it provides for that.
>>>>> 
>>>>> Mike
>>>>> 
>>>>> On Tue, Jun 28, 2016 at 8:30 AM, Saikat Kanjilal <sxk1969@hotmail.com
>>> 
>>>>> wrote:
>>>>> 
>>>>>> :) I'm using Kudu at work at the moment to troubleshoot some Tomcat
>>>>>> issues,  regarding the where to keep the source code I would say for
>>> now
>>>>>> lets go with the plugin approach and revisit the "where does the code
>>> live"
>>>>>> conversation later.  One thing I do want to discuss is that the
>> plugin
>>> will
>>>>>> act as a source or a sink depending on configuration, so if the
>> plugin
>>> acts
>>>>>> as a source we need a mechanism (like a daemon in syslog) to stream
>>> changes
>>>>>> real time from a graphdb into flume, I was wondering if there are any
>>> past
>>>>>> approaches around this that I can follow, I may need to dig into the
>>> neo4j
>>>>>> kernel to see where we can inject something like this.
>>>>>> Thoughts on that?
>>>>>> 
>>>>>>> From: mpercy@apache.org
>>>>>>> Date: Tue, 28 Jun 2016 00:27:45 -0700
>>>>>>> Subject: Re: [Discuss graph source/sink design proposal]
>>>>>>> To: dev@flume.apache.org
>>>>>>> 
>>>>>>> Hi Saikat,
>>>>>>> Please see my thoughts inline. This is how I think about this stuff;
>>>>>> others
>>>>>>> may think about it differently.
>>>>>>> 
>>>>>>> On Mon, Jun 27, 2016 at 8:45 PM, Saikat Kanjilal <
>> sxk1969@hotmail.com
>>>> 
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> Exactly right, I'm proposing we create a graph sink for flume while
>>>>>>>> keeping the flume core intact.
>>>>>>> 
>>>>>>> 
>>>>>>> As you are probably aware, sources and sinks don't have to be part
>> of
>>> the
>>>>>>> main Apache Flume source tree to be used with Flume. The plugins.d
>>>>>>> mechanism described in [1] makes building and integrating separate
>>>>>> plugins
>>>>>>> into Flume an easy thing to do at deployment time.
>>>>>>> 
>>>>>>> In another project I work on, Apache Kudu (incubating), we have a
>>> Flume
>>>>>>> Kudu sink committed in the main source tree [2]. We may at some
>> point
>>>>>>> propose to move it into the Flume source tree, but for now (for
>>> testing
>>>>>> and
>>>>>>> API stability reasons) it's easier to keep it in the Kudu source
>> tree.
>>>>>>> 
>>>>>>> Likewise, you could implement a Flume Neo4J sink and post it up on
>>> GitHub
>>>>>>> (or maybe in the Neo4J tree?). Donating it to the Apache Flume
>> project
>>>>>> once
>>>>>>> it's in decent shape may make sense at some point, especially if the
>>>>>>> dependencies are easy to share and integrate into the Flume project.
>>>>>>> However, I wouldn't say that it's a foregone conclusion that it
>> really
>>>>>>> needs to be part of the Flume source tree. Assuming you need the
>> sink,
>>>>>> and
>>>>>>> are going to implement it anyway, then maybe we can defer the
>>> discussion
>>>>>> of
>>>>>>> whether to include it in the Flume source tree until later. One of
>> the
>>>>>>> things I try to keep in mind when integrating new plugin code is
>>> whether
>>>>>>> the project will be able to support the maintenance burden of the
>> new
>>>>>> code.
>>>>>>> 
>>>>>>> In reading from a graph db we need a mechanism to stream data from
>> the
>>>>>>>> graph store into flume.
>>>>>>> 
>>>>>>> Yes, I'd say it could potentially make sense to create a Flume Neo4J
>>>>>> source
>>>>>>> as well. I think the same logic as above would still apply.
>>>>>>> 
>>>>>>> Regards,
>>>>>>> Mike
>>>>>>> 
>>>>>>> [1]
>> https://flume.apache.org/FlumeUserGuide.html#installing-third-party-plugins
>>>>>>> [2]
>> https://github.com/apache/incubator-kudu/tree/master/java/kudu-flume-sink
>>

Re: [Discuss graph source/sink design proposal]

Posted by Mike Percy <mp...@apache.org>.

For the Flume-Kafka integration we start up Kafka mini clusters in the unit
tests. It depends on the server. The project doesn't have any permanent
infrastructure in place with long running servers.

Mike

On Wed, Jul 13, 2016 at 5:37 PM, Saikat Kanjilal <sx...@hotmail.com>
wrote:

> Mike et al,
>
> Out of curiosity how do committers usually run integration tests when
> doing flume sink development, at some point I will have the graph sink
> talking to neo4j and would really rather not have to test everything
> locally as the performance of testing locally would make the whole
> operation not really reflect the actual sink performance.  Any ideas on how
> to get past this.  I'm not there yet but will be there in a few weeks where
> I'll need to start perf/integration testing.
>
>
> Thanks in advance.
>
>
> ________________________________
> From: Saikat Kanjilal <sx...@hotmail.com>
> Sent: Saturday, July 9, 2016 8:16 AM
> To: dev@flume.apache.org
> Subject: Re: [Discuss graph source/sink design proposal]
>
> Mike et al,
>
> To clarify again I'm starting with the hbase sink and modifying it to
> match the graph use case.  This si probably why you saw the hbase stuff
> still left over.  In a nutshell the design will look like the following:
>
>
> flume->neo4j (sink workflow)
>
> We batch events up from flume, we use the neo4j bolt driver to convert the
> batch of events into cipher statements and then we send the data in bulk
> into neo4j, one open question here might be how many go in a batch and
> should this be dynamically configurable
>
>
> neo4j->flume (source workflow)
>
> We add event listeners inside neo4j and then send data back into flume
> through these listeners, although here we'd need to really be careful about
> sending every single event, a batching strategy here might also make sense
> but takes out the concept of real time updates
>
>
> More later as I make more progress, also your criteria for acceptance of
> this sink is no different than accepting contributions to any other open
> source project , I guess I'd like to also know if there's interest from the
> community in connecting flume with neo4j as that would generate more
> feedback on the design.
>
> Here's a blurb on the new neo4j java and other languages interface:
>
> https://neo4j.com/blog/neo4j-3-0-language-drivers/
> A Deeper Dive into Neo4j 3.0 Language Drivers<
> https://neo4j.com/blog/neo4j-3-0-language-drivers/>
> neo4j.com
> Discover the four new language drivers for Neo4j 3.0 that provide easy
> access to Neo4j through a uniform API, regardless of programming language.
>
>
>
>
>
> Thanks
> A Deeper Dive into Neo4j 3.0 Language Drivers<
> https://neo4j.com/blog/neo4j-3-0-language-drivers/>
> neo4j.com
> Discover the four new language drivers for Neo4j 3.0 that provide easy
> access to Neo4j through a uniform API, regardless of programming language.
>
>
>
>
> ________________________________
> From: Mike Percy <mp...@apache.org>
> Sent: Friday, July 8, 2016 6:22 PM
> To: dev@flume.apache.org
> Subject: Re: [Discuss graph source/sink design proposal]
>
> Hi Saikat, please see my responses inline.
>
> On Thu, Jul 7, 2016 at 8:50 PM, Saikat Kanjilal <sx...@hotmail.com>
> wrote:
>
> > Ok moved the code to here:
> > https://bitbucket.org/skanjila/flume-ng-graph-sink
> [
> https://d301sr5gafysq2.cloudfront.net/e5b75889441d/img/repo-avatars/default.svg
> ]<https://bitbucket.org/skanjila/flume-ng-graph-sink>
>
> skanjila / flume-ng-graph-sink<
> https://bitbucket.org/skanjila/flume-ng-graph-sink>
> bitbucket.org
> Git repository hosted by Bitbucket.
>
>
>
>
>
> It looks like mostly still HBaseSink code right now, just with a different
> package name. I only looked at the Async one and that's what I found.
>
> Also I am exploring using the https://github.com/neo4j/neo4j-java-driver
> using
> > the bolt protocol to connect to neo4j to stream events
> >
>
> I don't know anything about Neo4J personally. Unfortunately I don't have
> time to really participate in development of this new sink using technology
> I have no use for, myself. Maybe there are others on this list that have
> the time and interest to help.
>
> Looking forward to getting feedback on this effort as y'all have time.
> >
>
> I apologize for not having the time to provide much guidance beyond the
> capabilities of Flume itself.
>
> In the future, as a committer on Flume, I would personally consider merging
> Neo4J support into the Flume source tree if the following conditions were
> met:
>
> 1. Strong feedback from others that this connector is desired by multiple
> members of the community
> 2. An implementation that is well designed, tested, and production-grade
> 3. A likely long-term maintainer (maybe that is you?)
>
> The reason I hesitate to add more integrations into the core is that if
> this breaks, and someone is using it, we will have to fix it. If someone
> asks a question on the mailing lists, we will have to attempt to answer it.
>
> Regards,
> Mike
>
>
> From: Saikat Kanjilal <sx...@hotmail.com>
> > Sent: Thursday, July 7, 2016 9:31 AM
> > To: dev@flume.apache.org
> > Subject: Re: [Discuss graph source/sink design proposal]
> >
> > Would it be ok to use bitbucket instead?  I have indeed extended
> > AbstractSink to build the graph sink, I will depend on flume-ng-core on
> my
> > pom as well.
> >
> > Thanks and feel free to respond on the cipher discussion as well as the
> > other items I mentioned earlier.
> >
> >
> > ________________________________
> > From: Mike Percy <mp...@apache.org>
> > Sent: Monday, July 4, 2016 12:03 PM
> > To: dev@flume.apache.org
> > Subject: Re: [Discuss graph source/sink design proposal]
> >
> > Hi Saikat,
> > I recommend you use GitHub. Private branches in ASF repos are only
> > available to committers.
> >
> > Regarding forking Flume, you should not need to do that. Just depend on
> > flume-ng-core in your pom and extend AbstractSink. Maven will pull in
> your
> > deps.
> >
> > I'm out of town for the next few days but I'll try to respond in more
> > detail to your design notes when I'm back in town.
> >
> > Mike
> >
> > Sent from my iPhone
> >
> > > On Jul 4, 2016, at 6:59 AM, Saikat Kanjilal <sx...@hotmail.com>
> wrote:
> > >
> > > Hari/Mike et al,
> > >
> > > I need a place to put interim checkins related to this work, is it
> > possible to get write privileges into a private branch so that I can
> commit
> > my code at intermediate junctures, I can also put it in bitbucket but
> would
> > rather not have to create yet another place for the code to live if it'll
> > eventually end up in the flume repo.
> > >
> > >
> > > Thanks in advance
> > >
> > >
> > > ________________________________
> > > From: Saikat Kanjilal <sx...@hotmail.com>
> > > Sent: Thursday, June 30, 2016 10:16 PM
> > > To: dev@flume.apache.org
> > > Subject: RE: [Discuss graph source/sink design proposal]
> > >
> > > So I've started the coding efforts on this, here's some details:
> > > 1) I've cloned the hbase sink for now and am refactoring all of that
> > code to work with neo4j as a start2) I'm only focusing on creating a sink
> > that will perform basic CRUD streaming operations into neo4j3) I've sent
> an
> > email to the neo4j guys to figure out details around building a streaming
> > architecture with the neo4j kernel4) In the meantime how would you guys
> > like to review the code, I've cloned the flume repo and have created a
> > branch called flume-2035 where I will work, should I put all the code in
> > bitbucket and send out periodic reviews, this is going to be a sizeable
> > effort5) How should we think about cipher related workflows as it relates
> > to the streaming data coming in , to see a ful flavor for cipher go here
> > https://neo4j.com/developer/cypher-query-language/
> > Neo4j's Graph Query Language: An Introduction to Cypher<
> > https://neo4j.com/developer/cypher-query-language/>
> > neo4j.com
> > Master the basics of Cypher – the graph query language for Neo4j – with
> > this introductory guide that teaches you how to read and write Cypher
> > queries.
> >
> >
> >
> > Neo4j's Graph Query Language: An Introduction to Cypher<
> > https://neo4j.com/developer/cypher-query-language/>
> > neo4j.com
> > Master the basics of Cypher – the graph query language for Neo4j – with
> > this introductory guide that teaches you how to read and write Cypher
> > queries.
> >
> >
> >
> > > Neo4j's Graph Query Language: An Introduction to Cypher<
> > https://neo4j.com/developer/cypher-query-language/>
> > > neo4j.com
> > > Master the basics of Cypher – the graph query language for Neo4j – with
> > this introductory guide that teaches you how to read and write Cypher
> > queries.
> > >
> > >
> > >
> > >
> > > Would love to get some discussion going on 2-5.
> > > Thanks
> > >
> > >> From: mpercy@apache.org
> > >> Date: Wed, 29 Jun 2016 17:24:16 -0700
> > >> Subject: Re: [Discuss graph source/sink design proposal]
> > >> To: dev@flume.apache.org
> > >>
> > >> Hmm, maybe a different Kudu project? Not sure.
> > >>
> > >> Anyway, this type of "changelog" thing would require support in the DB
> > for
> > >> streaming its write-ahead log or something. For example, we don't
> > support
> > >> that in Apache Kudu (incubating) -- maybe someday.
> > >>
> > >> Regarding Flume, I usually think it's useful to distinguish between a
> > >> source and a sink. They are typically written as separate classes and
> > they
> > >> represent different interfaces at the Flume Java API level.
> > >>
> > >> So, how would one write a streaming database source? That really
> > depends on
> > >> the database and the APIs it provides for that.
> > >>
> > >> Mike
> > >>
> > >> On Tue, Jun 28, 2016 at 8:30 AM, Saikat Kanjilal <sxk1969@hotmail.com
> >
> > >> wrote:
> > >>
> > >>> :) I'm using Kudu at work at the moment to troubleshoot some Tomcat
> > >>> issues,  regarding the where to keep the source code I would say for
> > now
> > >>> lets go with the plugin approach and revisit the "where does the code
> > live"
> > >>> conversation later.  One thing I do want to discuss is that the
> plugin
> > will
> > >>> act as a source or a sink depending on configuration, so if the
> plugin
> > acts
> > >>> as a source we need a mechanism (like a daemon in syslog) to stream
> > changes
> > >>> real time from a graphdb into flume, I was wondering if there are any
> > past
> > >>> approaches around this that I can follow, I may need to dig into the
> > neo4j
> > >>> kernel to see where we can inject something like this.
> > >>> Thoughts on that?
> > >>>
> > >>>> From: mpercy@apache.org
> > >>>> Date: Tue, 28 Jun 2016 00:27:45 -0700
> > >>>> Subject: Re: [Discuss graph source/sink design proposal]
> > >>>> To: dev@flume.apache.org
> > >>>>
> > >>>> Hi Saikat,
> > >>>> Please see my thoughts inline. This is how I think about this stuff;
> > >>> others
> > >>>> may think about it differently.
> > >>>>
> > >>>> On Mon, Jun 27, 2016 at 8:45 PM, Saikat Kanjilal <
> sxk1969@hotmail.com
> > >
> > >>>> wrote:
> > >>>>
> > >>>>> Exactly right, I'm proposing we create a graph sink for flume while
> > >>>>> keeping the flume core intact.
> > >>>>
> > >>>>
> > >>>> As you are probably aware, sources and sinks don't have to be part
> of
> > the
> > >>>> main Apache Flume source tree to be used with Flume. The plugins.d
> > >>>> mechanism described in [1] makes building and integrating separate
> > >>> plugins
> > >>>> into Flume an easy thing to do at deployment time.
> > >>>>
> > >>>> In another project I work on, Apache Kudu (incubating), we have a
> > Flume
> > >>>> Kudu sink committed in the main source tree [2]. We may at some
> point
> > >>>> propose to move it into the Flume source tree, but for now (for
> > testing
> > >>> and
> > >>>> API stability reasons) it's easier to keep it in the Kudu source
> tree.
> > >>>>
> > >>>> Likewise, you could implement a Flume Neo4J sink and post it up on
> > GitHub
> > >>>> (or maybe in the Neo4J tree?). Donating it to the Apache Flume
> project
> > >>> once
> > >>>> it's in decent shape may make sense at some point, especially if the
> > >>>> dependencies are easy to share and integrate into the Flume project.
> > >>>> However, I wouldn't say that it's a foregone conclusion that it
> really
> > >>>> needs to be part of the Flume source tree. Assuming you need the
> sink,
> > >>> and
> > >>>> are going to implement it anyway, then maybe we can defer the
> > discussion
> > >>> of
> > >>>> whether to include it in the Flume source tree until later. One of
> the
> > >>>> things I try to keep in mind when integrating new plugin code is
> > whether
> > >>>> the project will be able to support the maintenance burden of the
> new
> > >>> code.
> > >>>>
> > >>>> In reading from a graph db we need a mechanism to stream data from
> the
> > >>>>> graph store into flume.
> > >>>>
> > >>>> Yes, I'd say it could potentially make sense to create a Flume Neo4J
> > >>> source
> > >>>> as well. I think the same logic as above would still apply.
> > >>>>
> > >>>> Regards,
> > >>>> Mike
> > >>>>
> > >>>> [1]
> > >>>
> >
> https://flume.apache.org/FlumeUserGuide.html#installing-third-party-plugins
> > >>>> [2]
> > >>>
> >
> https://github.com/apache/incubator-kudu/tree/master/java/kudu-flume-sink
> > >
> >
> >
>

Re: [Discuss graph source/sink design proposal]

Posted by Saikat Kanjilal <sx...@hotmail.com>.

Mike et al,

Out of curiosity how do committers usually run integration tests when doing flume sink development, at some point I will have the graph sink talking to neo4j and would really rather not have to test everything locally as the performance of testing locally would make the whole operation not really reflect the actual sink performance.  Any ideas on how to get past this.  I'm not there yet but will be there in a few weeks where I'll need to start perf/integration testing.


Thanks in advance.


________________________________
From: Saikat Kanjilal <sx...@hotmail.com>
Sent: Saturday, July 9, 2016 8:16 AM
To: dev@flume.apache.org
Subject: Re: [Discuss graph source/sink design proposal]

Mike et al,

To clarify again I'm starting with the hbase sink and modifying it to match the graph use case.  This si probably why you saw the hbase stuff still left over.  In a nutshell the design will look like the following:


flume->neo4j (sink workflow)

We batch events up from flume, we use the neo4j bolt driver to convert the batch of events into cipher statements and then we send the data in bulk into neo4j, one open question here might be how many go in a batch and should this be dynamically configurable


neo4j->flume (source workflow)

We add event listeners inside neo4j and then send data back into flume through these listeners, although here we'd need to really be careful about sending every single event, a batching strategy here might also make sense but takes out the concept of real time updates


More later as I make more progress, also your criteria for acceptance of this sink is no different than accepting contributions to any other open source project , I guess I'd like to also know if there's interest from the community in connecting flume with neo4j as that would generate more feedback on the design.

Here's a blurb on the new neo4j java and other languages interface:

https://neo4j.com/blog/neo4j-3-0-language-drivers/
A Deeper Dive into Neo4j 3.0 Language Drivers<https://neo4j.com/blog/neo4j-3-0-language-drivers/>
neo4j.com
Discover the four new language drivers for Neo4j 3.0 that provide easy access to Neo4j through a uniform API, regardless of programming language.





Thanks
A Deeper Dive into Neo4j 3.0 Language Drivers<https://neo4j.com/blog/neo4j-3-0-language-drivers/>
neo4j.com
Discover the four new language drivers for Neo4j 3.0 that provide easy access to Neo4j through a uniform API, regardless of programming language.




________________________________
From: Mike Percy <mp...@apache.org>
Sent: Friday, July 8, 2016 6:22 PM
To: dev@flume.apache.org
Subject: Re: [Discuss graph source/sink design proposal]

Hi Saikat, please see my responses inline.

On Thu, Jul 7, 2016 at 8:50 PM, Saikat Kanjilal <sx...@hotmail.com> wrote:

> Ok moved the code to here:
> https://bitbucket.org/skanjila/flume-ng-graph-sink
[https://d301sr5gafysq2.cloudfront.net/e5b75889441d/img/repo-avatars/default.svg]<https://bitbucket.org/skanjila/flume-ng-graph-sink>

skanjila / flume-ng-graph-sink<https://bitbucket.org/skanjila/flume-ng-graph-sink>
bitbucket.org
Git repository hosted by Bitbucket.





It looks like mostly still HBaseSink code right now, just with a different
package name. I only looked at the Async one and that's what I found.

Also I am exploring using the https://github.com/neo4j/neo4j-java-driver using
> the bolt protocol to connect to neo4j to stream events
>

I don't know anything about Neo4J personally. Unfortunately I don't have
time to really participate in development of this new sink using technology
I have no use for, myself. Maybe there are others on this list that have
the time and interest to help.

Looking forward to getting feedback on this effort as y'all have time.
>

I apologize for not having the time to provide much guidance beyond the
capabilities of Flume itself.

In the future, as a committer on Flume, I would personally consider merging
Neo4J support into the Flume source tree if the following conditions were
met:

1. Strong feedback from others that this connector is desired by multiple
members of the community
2. An implementation that is well designed, tested, and production-grade
3. A likely long-term maintainer (maybe that is you?)

The reason I hesitate to add more integrations into the core is that if
this breaks, and someone is using it, we will have to fix it. If someone
asks a question on the mailing lists, we will have to attempt to answer it.

Regards,
Mike


From: Saikat Kanjilal <sx...@hotmail.com>
> Sent: Thursday, July 7, 2016 9:31 AM
> To: dev@flume.apache.org
> Subject: Re: [Discuss graph source/sink design proposal]
>
> Would it be ok to use bitbucket instead?  I have indeed extended
> AbstractSink to build the graph sink, I will depend on flume-ng-core on my
> pom as well.
>
> Thanks and feel free to respond on the cipher discussion as well as the
> other items I mentioned earlier.
>
>
> ________________________________
> From: Mike Percy <mp...@apache.org>
> Sent: Monday, July 4, 2016 12:03 PM
> To: dev@flume.apache.org
> Subject: Re: [Discuss graph source/sink design proposal]
>
> Hi Saikat,
> I recommend you use GitHub. Private branches in ASF repos are only
> available to committers.
>
> Regarding forking Flume, you should not need to do that. Just depend on
> flume-ng-core in your pom and extend AbstractSink. Maven will pull in your
> deps.
>
> I'm out of town for the next few days but I'll try to respond in more
> detail to your design notes when I'm back in town.
>
> Mike
>
> Sent from my iPhone
>
> > On Jul 4, 2016, at 6:59 AM, Saikat Kanjilal <sx...@hotmail.com> wrote:
> >
> > Hari/Mike et al,
> >
> > I need a place to put interim checkins related to this work, is it
> possible to get write privileges into a private branch so that I can commit
> my code at intermediate junctures, I can also put it in bitbucket but would
> rather not have to create yet another place for the code to live if it'll
> eventually end up in the flume repo.
> >
> >
> > Thanks in advance
> >
> >
> > ________________________________
> > From: Saikat Kanjilal <sx...@hotmail.com>
> > Sent: Thursday, June 30, 2016 10:16 PM
> > To: dev@flume.apache.org
> > Subject: RE: [Discuss graph source/sink design proposal]
> >
> > So I've started the coding efforts on this, here's some details:
> > 1) I've cloned the hbase sink for now and am refactoring all of that
> code to work with neo4j as a start2) I'm only focusing on creating a sink
> that will perform basic CRUD streaming operations into neo4j3) I've sent an
> email to the neo4j guys to figure out details around building a streaming
> architecture with the neo4j kernel4) In the meantime how would you guys
> like to review the code, I've cloned the flume repo and have created a
> branch called flume-2035 where I will work, should I put all the code in
> bitbucket and send out periodic reviews, this is going to be a sizeable
> effort5) How should we think about cipher related workflows as it relates
> to the streaming data coming in , to see a ful flavor for cipher go here
> https://neo4j.com/developer/cypher-query-language/
> Neo4j's Graph Query Language: An Introduction to Cypher<
> https://neo4j.com/developer/cypher-query-language/>
> neo4j.com
> Master the basics of Cypher – the graph query language for Neo4j – with
> this introductory guide that teaches you how to read and write Cypher
> queries.
>
>
>
> Neo4j's Graph Query Language: An Introduction to Cypher<
> https://neo4j.com/developer/cypher-query-language/>
> neo4j.com
> Master the basics of Cypher – the graph query language for Neo4j – with
> this introductory guide that teaches you how to read and write Cypher
> queries.
>
>
>
> > Neo4j's Graph Query Language: An Introduction to Cypher<
> https://neo4j.com/developer/cypher-query-language/>
> > neo4j.com
> > Master the basics of Cypher – the graph query language for Neo4j – with
> this introductory guide that teaches you how to read and write Cypher
> queries.
> >
> >
> >
> >
> > Would love to get some discussion going on 2-5.
> > Thanks
> >
> >> From: mpercy@apache.org
> >> Date: Wed, 29 Jun 2016 17:24:16 -0700
> >> Subject: Re: [Discuss graph source/sink design proposal]
> >> To: dev@flume.apache.org
> >>
> >> Hmm, maybe a different Kudu project? Not sure.
> >>
> >> Anyway, this type of "changelog" thing would require support in the DB
> for
> >> streaming its write-ahead log or something. For example, we don't
> support
> >> that in Apache Kudu (incubating) -- maybe someday.
> >>
> >> Regarding Flume, I usually think it's useful to distinguish between a
> >> source and a sink. They are typically written as separate classes and
> they
> >> represent different interfaces at the Flume Java API level.
> >>
> >> So, how would one write a streaming database source? That really
> depends on
> >> the database and the APIs it provides for that.
> >>
> >> Mike
> >>
> >> On Tue, Jun 28, 2016 at 8:30 AM, Saikat Kanjilal <sx...@hotmail.com>
> >> wrote:
> >>
> >>> :) I'm using Kudu at work at the moment to troubleshoot some Tomcat
> >>> issues,  regarding the where to keep the source code I would say for
> now
> >>> lets go with the plugin approach and revisit the "where does the code
> live"
> >>> conversation later.  One thing I do want to discuss is that the plugin
> will
> >>> act as a source or a sink depending on configuration, so if the plugin
> acts
> >>> as a source we need a mechanism (like a daemon in syslog) to stream
> changes
> >>> real time from a graphdb into flume, I was wondering if there are any
> past
> >>> approaches around this that I can follow, I may need to dig into the
> neo4j
> >>> kernel to see where we can inject something like this.
> >>> Thoughts on that?
> >>>
> >>>> From: mpercy@apache.org
> >>>> Date: Tue, 28 Jun 2016 00:27:45 -0700
> >>>> Subject: Re: [Discuss graph source/sink design proposal]
> >>>> To: dev@flume.apache.org
> >>>>
> >>>> Hi Saikat,
> >>>> Please see my thoughts inline. This is how I think about this stuff;
> >>> others
> >>>> may think about it differently.
> >>>>
> >>>> On Mon, Jun 27, 2016 at 8:45 PM, Saikat Kanjilal <sxk1969@hotmail.com
> >
> >>>> wrote:
> >>>>
> >>>>> Exactly right, I'm proposing we create a graph sink for flume while
> >>>>> keeping the flume core intact.
> >>>>
> >>>>
> >>>> As you are probably aware, sources and sinks don't have to be part of
> the
> >>>> main Apache Flume source tree to be used with Flume. The plugins.d
> >>>> mechanism described in [1] makes building and integrating separate
> >>> plugins
> >>>> into Flume an easy thing to do at deployment time.
> >>>>
> >>>> In another project I work on, Apache Kudu (incubating), we have a
> Flume
> >>>> Kudu sink committed in the main source tree [2]. We may at some point
> >>>> propose to move it into the Flume source tree, but for now (for
> testing
> >>> and
> >>>> API stability reasons) it's easier to keep it in the Kudu source tree.
> >>>>
> >>>> Likewise, you could implement a Flume Neo4J sink and post it up on
> GitHub
> >>>> (or maybe in the Neo4J tree?). Donating it to the Apache Flume project
> >>> once
> >>>> it's in decent shape may make sense at some point, especially if the
> >>>> dependencies are easy to share and integrate into the Flume project.
> >>>> However, I wouldn't say that it's a foregone conclusion that it really
> >>>> needs to be part of the Flume source tree. Assuming you need the sink,
> >>> and
> >>>> are going to implement it anyway, then maybe we can defer the
> discussion
> >>> of
> >>>> whether to include it in the Flume source tree until later. One of the
> >>>> things I try to keep in mind when integrating new plugin code is
> whether
> >>>> the project will be able to support the maintenance burden of the new
> >>> code.
> >>>>
> >>>> In reading from a graph db we need a mechanism to stream data from the
> >>>>> graph store into flume.
> >>>>
> >>>> Yes, I'd say it could potentially make sense to create a Flume Neo4J
> >>> source
> >>>> as well. I think the same logic as above would still apply.
> >>>>
> >>>> Regards,
> >>>> Mike
> >>>>
> >>>> [1]
> >>>
> https://flume.apache.org/FlumeUserGuide.html#installing-third-party-plugins
> >>>> [2]
> >>>
> https://github.com/apache/incubator-kudu/tree/master/java/kudu-flume-sink
> >
>
>

Re: [Discuss graph source/sink design proposal]

Posted by Saikat Kanjilal <sx...@hotmail.com>.

Mike et al,

To clarify again I'm starting with the hbase sink and modifying it to match the graph use case.  This si probably why you saw the hbase stuff still left over.  In a nutshell the design will look like the following:


flume->neo4j (sink workflow)

We batch events up from flume, we use the neo4j bolt driver to convert the batch of events into cipher statements and then we send the data in bulk into neo4j, one open question here might be how many go in a batch and should this be dynamically configurable


neo4j->flume (source workflow)

We add event listeners inside neo4j and then send data back into flume through these listeners, although here we'd need to really be careful about sending every single event, a batching strategy here might also make sense but takes out the concept of real time updates


More later as I make more progress, also your criteria for acceptance of this sink is no different than accepting contributions to any other open source project , I guess I'd like to also know if there's interest from the community in connecting flume with neo4j as that would generate more feedback on the design.

Here's a blurb on the new neo4j java and other languages interface:

https://neo4j.com/blog/neo4j-3-0-language-drivers/


Thanks
A Deeper Dive into Neo4j 3.0 Language Drivers<https://neo4j.com/blog/neo4j-3-0-language-drivers/>
neo4j.com
Discover the four new language drivers for Neo4j 3.0 that provide easy access to Neo4j through a uniform API, regardless of programming language.




________________________________
From: Mike Percy <mp...@apache.org>
Sent: Friday, July 8, 2016 6:22 PM
To: dev@flume.apache.org
Subject: Re: [Discuss graph source/sink design proposal]

Hi Saikat, please see my responses inline.

On Thu, Jul 7, 2016 at 8:50 PM, Saikat Kanjilal <sx...@hotmail.com> wrote:

> Ok moved the code to here:
> https://bitbucket.org/skanjila/flume-ng-graph-sink
[https://d301sr5gafysq2.cloudfront.net/e5b75889441d/img/repo-avatars/default.svg]<https://bitbucket.org/skanjila/flume-ng-graph-sink>

skanjila / flume-ng-graph-sink<https://bitbucket.org/skanjila/flume-ng-graph-sink>
bitbucket.org
Git repository hosted by Bitbucket.





It looks like mostly still HBaseSink code right now, just with a different
package name. I only looked at the Async one and that's what I found.

Also I am exploring using the https://github.com/neo4j/neo4j-java-driver using
> the bolt protocol to connect to neo4j to stream events
>

I don't know anything about Neo4J personally. Unfortunately I don't have
time to really participate in development of this new sink using technology
I have no use for, myself. Maybe there are others on this list that have
the time and interest to help.

Looking forward to getting feedback on this effort as y'all have time.
>

I apologize for not having the time to provide much guidance beyond the
capabilities of Flume itself.

In the future, as a committer on Flume, I would personally consider merging
Neo4J support into the Flume source tree if the following conditions were
met:

1. Strong feedback from others that this connector is desired by multiple
members of the community
2. An implementation that is well designed, tested, and production-grade
3. A likely long-term maintainer (maybe that is you?)

The reason I hesitate to add more integrations into the core is that if
this breaks, and someone is using it, we will have to fix it. If someone
asks a question on the mailing lists, we will have to attempt to answer it.

Regards,
Mike


From: Saikat Kanjilal <sx...@hotmail.com>
> Sent: Thursday, July 7, 2016 9:31 AM
> To: dev@flume.apache.org
> Subject: Re: [Discuss graph source/sink design proposal]
>
> Would it be ok to use bitbucket instead?  I have indeed extended
> AbstractSink to build the graph sink, I will depend on flume-ng-core on my
> pom as well.
>
> Thanks and feel free to respond on the cipher discussion as well as the
> other items I mentioned earlier.
>
>
> ________________________________
> From: Mike Percy <mp...@apache.org>
> Sent: Monday, July 4, 2016 12:03 PM
> To: dev@flume.apache.org
> Subject: Re: [Discuss graph source/sink design proposal]
>
> Hi Saikat,
> I recommend you use GitHub. Private branches in ASF repos are only
> available to committers.
>
> Regarding forking Flume, you should not need to do that. Just depend on
> flume-ng-core in your pom and extend AbstractSink. Maven will pull in your
> deps.
>
> I'm out of town for the next few days but I'll try to respond in more
> detail to your design notes when I'm back in town.
>
> Mike
>
> Sent from my iPhone
>
> > On Jul 4, 2016, at 6:59 AM, Saikat Kanjilal <sx...@hotmail.com> wrote:
> >
> > Hari/Mike et al,
> >
> > I need a place to put interim checkins related to this work, is it
> possible to get write privileges into a private branch so that I can commit
> my code at intermediate junctures, I can also put it in bitbucket but would
> rather not have to create yet another place for the code to live if it'll
> eventually end up in the flume repo.
> >
> >
> > Thanks in advance
> >
> >
> > ________________________________
> > From: Saikat Kanjilal <sx...@hotmail.com>
> > Sent: Thursday, June 30, 2016 10:16 PM
> > To: dev@flume.apache.org
> > Subject: RE: [Discuss graph source/sink design proposal]
> >
> > So I've started the coding efforts on this, here's some details:
> > 1) I've cloned the hbase sink for now and am refactoring all of that
> code to work with neo4j as a start2) I'm only focusing on creating a sink
> that will perform basic CRUD streaming operations into neo4j3) I've sent an
> email to the neo4j guys to figure out details around building a streaming
> architecture with the neo4j kernel4) In the meantime how would you guys
> like to review the code, I've cloned the flume repo and have created a
> branch called flume-2035 where I will work, should I put all the code in
> bitbucket and send out periodic reviews, this is going to be a sizeable
> effort5) How should we think about cipher related workflows as it relates
> to the streaming data coming in , to see a ful flavor for cipher go here
> https://neo4j.com/developer/cypher-query-language/
> Neo4j's Graph Query Language: An Introduction to Cypher<
> https://neo4j.com/developer/cypher-query-language/>
> neo4j.com
> Master the basics of Cypher – the graph query language for Neo4j – with
> this introductory guide that teaches you how to read and write Cypher
> queries.
>
>
>
> Neo4j's Graph Query Language: An Introduction to Cypher<
> https://neo4j.com/developer/cypher-query-language/>
> neo4j.com
> Master the basics of Cypher – the graph query language for Neo4j – with
> this introductory guide that teaches you how to read and write Cypher
> queries.
>
>
>
> > Neo4j's Graph Query Language: An Introduction to Cypher<
> https://neo4j.com/developer/cypher-query-language/>
> > neo4j.com
> > Master the basics of Cypher – the graph query language for Neo4j – with
> this introductory guide that teaches you how to read and write Cypher
> queries.
> >
> >
> >
> >
> > Would love to get some discussion going on 2-5.
> > Thanks
> >
> >> From: mpercy@apache.org
> >> Date: Wed, 29 Jun 2016 17:24:16 -0700
> >> Subject: Re: [Discuss graph source/sink design proposal]
> >> To: dev@flume.apache.org
> >>
> >> Hmm, maybe a different Kudu project? Not sure.
> >>
> >> Anyway, this type of "changelog" thing would require support in the DB
> for
> >> streaming its write-ahead log or something. For example, we don't
> support
> >> that in Apache Kudu (incubating) -- maybe someday.
> >>
> >> Regarding Flume, I usually think it's useful to distinguish between a
> >> source and a sink. They are typically written as separate classes and
> they
> >> represent different interfaces at the Flume Java API level.
> >>
> >> So, how would one write a streaming database source? That really
> depends on
> >> the database and the APIs it provides for that.
> >>
> >> Mike
> >>
> >> On Tue, Jun 28, 2016 at 8:30 AM, Saikat Kanjilal <sx...@hotmail.com>
> >> wrote:
> >>
> >>> :) I'm using Kudu at work at the moment to troubleshoot some Tomcat
> >>> issues,  regarding the where to keep the source code I would say for
> now
> >>> lets go with the plugin approach and revisit the "where does the code
> live"
> >>> conversation later.  One thing I do want to discuss is that the plugin
> will
> >>> act as a source or a sink depending on configuration, so if the plugin
> acts
> >>> as a source we need a mechanism (like a daemon in syslog) to stream
> changes
> >>> real time from a graphdb into flume, I was wondering if there are any
> past
> >>> approaches around this that I can follow, I may need to dig into the
> neo4j
> >>> kernel to see where we can inject something like this.
> >>> Thoughts on that?
> >>>
> >>>> From: mpercy@apache.org
> >>>> Date: Tue, 28 Jun 2016 00:27:45 -0700
> >>>> Subject: Re: [Discuss graph source/sink design proposal]
> >>>> To: dev@flume.apache.org
> >>>>
> >>>> Hi Saikat,
> >>>> Please see my thoughts inline. This is how I think about this stuff;
> >>> others
> >>>> may think about it differently.
> >>>>
> >>>> On Mon, Jun 27, 2016 at 8:45 PM, Saikat Kanjilal <sxk1969@hotmail.com
> >
> >>>> wrote:
> >>>>
> >>>>> Exactly right, I'm proposing we create a graph sink for flume while
> >>>>> keeping the flume core intact.
> >>>>
> >>>>
> >>>> As you are probably aware, sources and sinks don't have to be part of
> the
> >>>> main Apache Flume source tree to be used with Flume. The plugins.d
> >>>> mechanism described in [1] makes building and integrating separate
> >>> plugins
> >>>> into Flume an easy thing to do at deployment time.
> >>>>
> >>>> In another project I work on, Apache Kudu (incubating), we have a
> Flume
> >>>> Kudu sink committed in the main source tree [2]. We may at some point
> >>>> propose to move it into the Flume source tree, but for now (for
> testing
> >>> and
> >>>> API stability reasons) it's easier to keep it in the Kudu source tree.
> >>>>
> >>>> Likewise, you could implement a Flume Neo4J sink and post it up on
> GitHub
> >>>> (or maybe in the Neo4J tree?). Donating it to the Apache Flume project
> >>> once
> >>>> it's in decent shape may make sense at some point, especially if the
> >>>> dependencies are easy to share and integrate into the Flume project.
> >>>> However, I wouldn't say that it's a foregone conclusion that it really
> >>>> needs to be part of the Flume source tree. Assuming you need the sink,
> >>> and
> >>>> are going to implement it anyway, then maybe we can defer the
> discussion
> >>> of
> >>>> whether to include it in the Flume source tree until later. One of the
> >>>> things I try to keep in mind when integrating new plugin code is
> whether
> >>>> the project will be able to support the maintenance burden of the new
> >>> code.
> >>>>
> >>>> In reading from a graph db we need a mechanism to stream data from the
> >>>>> graph store into flume.
> >>>>
> >>>> Yes, I'd say it could potentially make sense to create a Flume Neo4J
> >>> source
> >>>> as well. I think the same logic as above would still apply.
> >>>>
> >>>> Regards,
> >>>> Mike
> >>>>
> >>>> [1]
> >>>
> https://flume.apache.org/FlumeUserGuide.html#installing-third-party-plugins
> >>>> [2]
> >>>
> https://github.com/apache/incubator-kudu/tree/master/java/kudu-flume-sink
> >
>
>

Re: [Discuss graph source/sink design proposal]

Posted by Mike Percy <mp...@apache.org>.

Hi Saikat, please see my responses inline.

On Thu, Jul 7, 2016 at 8:50 PM, Saikat Kanjilal <sx...@hotmail.com> wrote:

> Ok moved the code to here:
> https://bitbucket.org/skanjila/flume-ng-graph-sink


It looks like mostly still HBaseSink code right now, just with a different
package name. I only looked at the Async one and that's what I found.

Also I am exploring using the https://github.com/neo4j/neo4j-java-driver using
> the bolt protocol to connect to neo4j to stream events
>

I don't know anything about Neo4J personally. Unfortunately I don't have
time to really participate in development of this new sink using technology
I have no use for, myself. Maybe there are others on this list that have
the time and interest to help.

Looking forward to getting feedback on this effort as y'all have time.
>

I apologize for not having the time to provide much guidance beyond the
capabilities of Flume itself.

In the future, as a committer on Flume, I would personally consider merging
Neo4J support into the Flume source tree if the following conditions were
met:

1. Strong feedback from others that this connector is desired by multiple
members of the community
2. An implementation that is well designed, tested, and production-grade
3. A likely long-term maintainer (maybe that is you?)

The reason I hesitate to add more integrations into the core is that if
this breaks, and someone is using it, we will have to fix it. If someone
asks a question on the mailing lists, we will have to attempt to answer it.

Regards,
Mike


From: Saikat Kanjilal <sx...@hotmail.com>
> Sent: Thursday, July 7, 2016 9:31 AM
> To: dev@flume.apache.org
> Subject: Re: [Discuss graph source/sink design proposal]
>
> Would it be ok to use bitbucket instead?  I have indeed extended
> AbstractSink to build the graph sink, I will depend on flume-ng-core on my
> pom as well.
>
> Thanks and feel free to respond on the cipher discussion as well as the
> other items I mentioned earlier.
>
>
> ________________________________
> From: Mike Percy <mp...@apache.org>
> Sent: Monday, July 4, 2016 12:03 PM
> To: dev@flume.apache.org
> Subject: Re: [Discuss graph source/sink design proposal]
>
> Hi Saikat,
> I recommend you use GitHub. Private branches in ASF repos are only
> available to committers.
>
> Regarding forking Flume, you should not need to do that. Just depend on
> flume-ng-core in your pom and extend AbstractSink. Maven will pull in your
> deps.
>
> I'm out of town for the next few days but I'll try to respond in more
> detail to your design notes when I'm back in town.
>
> Mike
>
> Sent from my iPhone
>
> > On Jul 4, 2016, at 6:59 AM, Saikat Kanjilal <sx...@hotmail.com> wrote:
> >
> > Hari/Mike et al,
> >
> > I need a place to put interim checkins related to this work, is it
> possible to get write privileges into a private branch so that I can commit
> my code at intermediate junctures, I can also put it in bitbucket but would
> rather not have to create yet another place for the code to live if it'll
> eventually end up in the flume repo.
> >
> >
> > Thanks in advance
> >
> >
> > ________________________________
> > From: Saikat Kanjilal <sx...@hotmail.com>
> > Sent: Thursday, June 30, 2016 10:16 PM
> > To: dev@flume.apache.org
> > Subject: RE: [Discuss graph source/sink design proposal]
> >
> > So I've started the coding efforts on this, here's some details:
> > 1) I've cloned the hbase sink for now and am refactoring all of that
> code to work with neo4j as a start2) I'm only focusing on creating a sink
> that will perform basic CRUD streaming operations into neo4j3) I've sent an
> email to the neo4j guys to figure out details around building a streaming
> architecture with the neo4j kernel4) In the meantime how would you guys
> like to review the code, I've cloned the flume repo and have created a
> branch called flume-2035 where I will work, should I put all the code in
> bitbucket and send out periodic reviews, this is going to be a sizeable
> effort5) How should we think about cipher related workflows as it relates
> to the streaming data coming in , to see a ful flavor for cipher go here
> https://neo4j.com/developer/cypher-query-language/
> Neo4j's Graph Query Language: An Introduction to Cypher<
> https://neo4j.com/developer/cypher-query-language/>
> neo4j.com
> Master the basics of Cypher – the graph query language for Neo4j – with
> this introductory guide that teaches you how to read and write Cypher
> queries.
>
>
>
> Neo4j's Graph Query Language: An Introduction to Cypher<
> https://neo4j.com/developer/cypher-query-language/>
> neo4j.com
> Master the basics of Cypher – the graph query language for Neo4j – with
> this introductory guide that teaches you how to read and write Cypher
> queries.
>
>
>
> > Neo4j's Graph Query Language: An Introduction to Cypher<
> https://neo4j.com/developer/cypher-query-language/>
> > neo4j.com
> > Master the basics of Cypher – the graph query language for Neo4j – with
> this introductory guide that teaches you how to read and write Cypher
> queries.
> >
> >
> >
> >
> > Would love to get some discussion going on 2-5.
> > Thanks
> >
> >> From: mpercy@apache.org
> >> Date: Wed, 29 Jun 2016 17:24:16 -0700
> >> Subject: Re: [Discuss graph source/sink design proposal]
> >> To: dev@flume.apache.org
> >>
> >> Hmm, maybe a different Kudu project? Not sure.
> >>
> >> Anyway, this type of "changelog" thing would require support in the DB
> for
> >> streaming its write-ahead log or something. For example, we don't
> support
> >> that in Apache Kudu (incubating) -- maybe someday.
> >>
> >> Regarding Flume, I usually think it's useful to distinguish between a
> >> source and a sink. They are typically written as separate classes and
> they
> >> represent different interfaces at the Flume Java API level.
> >>
> >> So, how would one write a streaming database source? That really
> depends on
> >> the database and the APIs it provides for that.
> >>
> >> Mike
> >>
> >> On Tue, Jun 28, 2016 at 8:30 AM, Saikat Kanjilal <sx...@hotmail.com>
> >> wrote:
> >>
> >>> :) I'm using Kudu at work at the moment to troubleshoot some Tomcat
> >>> issues,  regarding the where to keep the source code I would say for
> now
> >>> lets go with the plugin approach and revisit the "where does the code
> live"
> >>> conversation later.  One thing I do want to discuss is that the plugin
> will
> >>> act as a source or a sink depending on configuration, so if the plugin
> acts
> >>> as a source we need a mechanism (like a daemon in syslog) to stream
> changes
> >>> real time from a graphdb into flume, I was wondering if there are any
> past
> >>> approaches around this that I can follow, I may need to dig into the
> neo4j
> >>> kernel to see where we can inject something like this.
> >>> Thoughts on that?
> >>>
> >>>> From: mpercy@apache.org
> >>>> Date: Tue, 28 Jun 2016 00:27:45 -0700
> >>>> Subject: Re: [Discuss graph source/sink design proposal]
> >>>> To: dev@flume.apache.org
> >>>>
> >>>> Hi Saikat,
> >>>> Please see my thoughts inline. This is how I think about this stuff;
> >>> others
> >>>> may think about it differently.
> >>>>
> >>>> On Mon, Jun 27, 2016 at 8:45 PM, Saikat Kanjilal <sxk1969@hotmail.com
> >
> >>>> wrote:
> >>>>
> >>>>> Exactly right, I'm proposing we create a graph sink for flume while
> >>>>> keeping the flume core intact.
> >>>>
> >>>>
> >>>> As you are probably aware, sources and sinks don't have to be part of
> the
> >>>> main Apache Flume source tree to be used with Flume. The plugins.d
> >>>> mechanism described in [1] makes building and integrating separate
> >>> plugins
> >>>> into Flume an easy thing to do at deployment time.
> >>>>
> >>>> In another project I work on, Apache Kudu (incubating), we have a
> Flume
> >>>> Kudu sink committed in the main source tree [2]. We may at some point
> >>>> propose to move it into the Flume source tree, but for now (for
> testing
> >>> and
> >>>> API stability reasons) it's easier to keep it in the Kudu source tree.
> >>>>
> >>>> Likewise, you could implement a Flume Neo4J sink and post it up on
> GitHub
> >>>> (or maybe in the Neo4J tree?). Donating it to the Apache Flume project
> >>> once
> >>>> it's in decent shape may make sense at some point, especially if the
> >>>> dependencies are easy to share and integrate into the Flume project.
> >>>> However, I wouldn't say that it's a foregone conclusion that it really
> >>>> needs to be part of the Flume source tree. Assuming you need the sink,
> >>> and
> >>>> are going to implement it anyway, then maybe we can defer the
> discussion
> >>> of
> >>>> whether to include it in the Flume source tree until later. One of the
> >>>> things I try to keep in mind when integrating new plugin code is
> whether
> >>>> the project will be able to support the maintenance burden of the new
> >>> code.
> >>>>
> >>>> In reading from a graph db we need a mechanism to stream data from the
> >>>>> graph store into flume.
> >>>>
> >>>> Yes, I'd say it could potentially make sense to create a Flume Neo4J
> >>> source
> >>>> as well. I think the same logic as above would still apply.
> >>>>
> >>>> Regards,
> >>>> Mike
> >>>>
> >>>> [1]
> >>>
> https://flume.apache.org/FlumeUserGuide.html#installing-third-party-plugins
> >>>> [2]
> >>>
> https://github.com/apache/incubator-kudu/tree/master/java/kudu-flume-sink
> >
>
>

Re: [Discuss graph source/sink design proposal]

Posted by Saikat Kanjilal <sx...@hotmail.com>.

Ok moved the code to here:


https://bitbucket.org/skanjila/flume-ng-graph-sink

[https://d301sr5gafysq2.cloudfront.net/bf2b94402438/img/repo-avatars/default.svg]<https://bitbucket.org/skanjila/flume-ng-graph-sink>

skanjila / flume-ng-graph-sink<https://bitbucket.org/skanjila/flume-ng-graph-sink>
bitbucket.org
Git repository hosted by Bitbucket.






Also I am exploring using the n<https://github.com/neo4j/neo4j-java-driver>ew neo4j java driver using the bolt protocol to connect to neo4j to stream events:

https://github.com/neo4j/neo4j-java-driver

[https://avatars1.githubusercontent.com/u/201120?v=3&s=400]<https://github.com/neo4j/neo4j-java-driver>

GitHub - neo4j/neo4j-java-driver: Neo4j Bolt driver for Java<https://github.com/neo4j/neo4j-java-driver>
github.com
README.md Neo4j Java Driver. This is the first official Neo4j java driver for connecting to Neo4j-the-database via the newly designed remoting protocol BOLT.





[https://avatars1.githubusercontent.com/u/201120?v=3&s=400]<https://github.com/neo4j/neo4j-java-driver>

GitHub - neo4j/neo4j-java-driver: Neo4j Bolt driver for Java<https://github.com/neo4j/neo4j-java-driver>
github.com
README.md Neo4j Java Driver. This is the first official Neo4j java driver for connecting to Neo4j-the-database via the newly designed remoting protocol BOLT.


Looking forward to getting feedback on this effort as y'all have time.
Thanks

________________________________
From: Saikat Kanjilal <sx...@hotmail.com>
Sent: Thursday, July 7, 2016 9:31 AM
To: dev@flume.apache.org
Subject: Re: [Discuss graph source/sink design proposal]

Would it be ok to use bitbucket instead?  I have indeed extended AbstractSink to build the graph sink, I will depend on flume-ng-core on my pom as well.

Thanks and feel free to respond on the cipher discussion as well as the other items I mentioned earlier.


________________________________
From: Mike Percy <mp...@apache.org>
Sent: Monday, July 4, 2016 12:03 PM
To: dev@flume.apache.org
Subject: Re: [Discuss graph source/sink design proposal]

Hi Saikat,
I recommend you use GitHub. Private branches in ASF repos are only available to committers.

Regarding forking Flume, you should not need to do that. Just depend on flume-ng-core in your pom and extend AbstractSink. Maven will pull in your deps.

I'm out of town for the next few days but I'll try to respond in more detail to your design notes when I'm back in town.

Mike

Sent from my iPhone

> On Jul 4, 2016, at 6:59 AM, Saikat Kanjilal <sx...@hotmail.com> wrote:
>
> Hari/Mike et al,
>
> I need a place to put interim checkins related to this work, is it possible to get write privileges into a private branch so that I can commit my code at intermediate junctures, I can also put it in bitbucket but would rather not have to create yet another place for the code to live if it'll eventually end up in the flume repo.
>
>
> Thanks in advance
>
>
> ________________________________
> From: Saikat Kanjilal <sx...@hotmail.com>
> Sent: Thursday, June 30, 2016 10:16 PM
> To: dev@flume.apache.org
> Subject: RE: [Discuss graph source/sink design proposal]
>
> So I've started the coding efforts on this, here's some details:
> 1) I've cloned the hbase sink for now and am refactoring all of that code to work with neo4j as a start2) I'm only focusing on creating a sink that will perform basic CRUD streaming operations into neo4j3) I've sent an email to the neo4j guys to figure out details around building a streaming architecture with the neo4j kernel4) In the meantime how would you guys like to review the code, I've cloned the flume repo and have created a branch called flume-2035 where I will work, should I put all the code in bitbucket and send out periodic reviews, this is going to be a sizeable effort5) How should we think about cipher related workflows as it relates to the streaming data coming in , to see a ful flavor for cipher go here https://neo4j.com/developer/cypher-query-language/
Neo4j's Graph Query Language: An Introduction to Cypher<https://neo4j.com/developer/cypher-query-language/>
neo4j.com
Master the basics of Cypher – the graph query language for Neo4j – with this introductory guide that teaches you how to read and write Cypher queries.



Neo4j's Graph Query Language: An Introduction to Cypher<https://neo4j.com/developer/cypher-query-language/>
neo4j.com
Master the basics of Cypher – the graph query language for Neo4j – with this introductory guide that teaches you how to read and write Cypher queries.



> Neo4j's Graph Query Language: An Introduction to Cypher<https://neo4j.com/developer/cypher-query-language/>
> neo4j.com
> Master the basics of Cypher – the graph query language for Neo4j – with this introductory guide that teaches you how to read and write Cypher queries.
>
>
>
>
> Would love to get some discussion going on 2-5.
> Thanks
>
>> From: mpercy@apache.org
>> Date: Wed, 29 Jun 2016 17:24:16 -0700
>> Subject: Re: [Discuss graph source/sink design proposal]
>> To: dev@flume.apache.org
>>
>> Hmm, maybe a different Kudu project? Not sure.
>>
>> Anyway, this type of "changelog" thing would require support in the DB for
>> streaming its write-ahead log or something. For example, we don't support
>> that in Apache Kudu (incubating) -- maybe someday.
>>
>> Regarding Flume, I usually think it's useful to distinguish between a
>> source and a sink. They are typically written as separate classes and they
>> represent different interfaces at the Flume Java API level.
>>
>> So, how would one write a streaming database source? That really depends on
>> the database and the APIs it provides for that.
>>
>> Mike
>>
>> On Tue, Jun 28, 2016 at 8:30 AM, Saikat Kanjilal <sx...@hotmail.com>
>> wrote:
>>
>>> :) I'm using Kudu at work at the moment to troubleshoot some Tomcat
>>> issues,  regarding the where to keep the source code I would say for now
>>> lets go with the plugin approach and revisit the "where does the code live"
>>> conversation later.  One thing I do want to discuss is that the plugin will
>>> act as a source or a sink depending on configuration, so if the plugin acts
>>> as a source we need a mechanism (like a daemon in syslog) to stream changes
>>> real time from a graphdb into flume, I was wondering if there are any past
>>> approaches around this that I can follow, I may need to dig into the neo4j
>>> kernel to see where we can inject something like this.
>>> Thoughts on that?
>>>
>>>> From: mpercy@apache.org
>>>> Date: Tue, 28 Jun 2016 00:27:45 -0700
>>>> Subject: Re: [Discuss graph source/sink design proposal]
>>>> To: dev@flume.apache.org
>>>>
>>>> Hi Saikat,
>>>> Please see my thoughts inline. This is how I think about this stuff;
>>> others
>>>> may think about it differently.
>>>>
>>>> On Mon, Jun 27, 2016 at 8:45 PM, Saikat Kanjilal <sx...@hotmail.com>
>>>> wrote:
>>>>
>>>>> Exactly right, I'm proposing we create a graph sink for flume while
>>>>> keeping the flume core intact.
>>>>
>>>>
>>>> As you are probably aware, sources and sinks don't have to be part of the
>>>> main Apache Flume source tree to be used with Flume. The plugins.d
>>>> mechanism described in [1] makes building and integrating separate
>>> plugins
>>>> into Flume an easy thing to do at deployment time.
>>>>
>>>> In another project I work on, Apache Kudu (incubating), we have a Flume
>>>> Kudu sink committed in the main source tree [2]. We may at some point
>>>> propose to move it into the Flume source tree, but for now (for testing
>>> and
>>>> API stability reasons) it's easier to keep it in the Kudu source tree.
>>>>
>>>> Likewise, you could implement a Flume Neo4J sink and post it up on GitHub
>>>> (or maybe in the Neo4J tree?). Donating it to the Apache Flume project
>>> once
>>>> it's in decent shape may make sense at some point, especially if the
>>>> dependencies are easy to share and integrate into the Flume project.
>>>> However, I wouldn't say that it's a foregone conclusion that it really
>>>> needs to be part of the Flume source tree. Assuming you need the sink,
>>> and
>>>> are going to implement it anyway, then maybe we can defer the discussion
>>> of
>>>> whether to include it in the Flume source tree until later. One of the
>>>> things I try to keep in mind when integrating new plugin code is whether
>>>> the project will be able to support the maintenance burden of the new
>>> code.
>>>>
>>>> In reading from a graph db we need a mechanism to stream data from the
>>>>> graph store into flume.
>>>>
>>>> Yes, I'd say it could potentially make sense to create a Flume Neo4J
>>> source
>>>> as well. I think the same logic as above would still apply.
>>>>
>>>> Regards,
>>>> Mike
>>>>
>>>> [1]
>>> https://flume.apache.org/FlumeUserGuide.html#installing-third-party-plugins
>>>> [2]
>>> https://github.com/apache/incubator-kudu/tree/master/java/kudu-flume-sink
>

Re: [Discuss graph source/sink design proposal]

Posted by Saikat Kanjilal <sx...@hotmail.com>.

Would it be ok to use bitbucket instead?  I have indeed extended AbstractSink to build the graph sink, I will depend on flume-ng-core on my pom as well.

Thanks and feel free to respond on the cipher discussion as well as the other items I mentioned earlier.


________________________________
From: Mike Percy <mp...@apache.org>
Sent: Monday, July 4, 2016 12:03 PM
To: dev@flume.apache.org
Subject: Re: [Discuss graph source/sink design proposal]

Hi Saikat,
I recommend you use GitHub. Private branches in ASF repos are only available to committers.

Regarding forking Flume, you should not need to do that. Just depend on flume-ng-core in your pom and extend AbstractSink. Maven will pull in your deps.

I'm out of town for the next few days but I'll try to respond in more detail to your design notes when I'm back in town.

Mike

Sent from my iPhone

> On Jul 4, 2016, at 6:59 AM, Saikat Kanjilal <sx...@hotmail.com> wrote:
>
> Hari/Mike et al,
>
> I need a place to put interim checkins related to this work, is it possible to get write privileges into a private branch so that I can commit my code at intermediate junctures, I can also put it in bitbucket but would rather not have to create yet another place for the code to live if it'll eventually end up in the flume repo.
>
>
> Thanks in advance
>
>
> ________________________________
> From: Saikat Kanjilal <sx...@hotmail.com>
> Sent: Thursday, June 30, 2016 10:16 PM
> To: dev@flume.apache.org
> Subject: RE: [Discuss graph source/sink design proposal]
>
> So I've started the coding efforts on this, here's some details:
> 1) I've cloned the hbase sink for now and am refactoring all of that code to work with neo4j as a start2) I'm only focusing on creating a sink that will perform basic CRUD streaming operations into neo4j3) I've sent an email to the neo4j guys to figure out details around building a streaming architecture with the neo4j kernel4) In the meantime how would you guys like to review the code, I've cloned the flume repo and have created a branch called flume-2035 where I will work, should I put all the code in bitbucket and send out periodic reviews, this is going to be a sizeable effort5) How should we think about cipher related workflows as it relates to the streaming data coming in , to see a ful flavor for cipher go here https://neo4j.com/developer/cypher-query-language/
Neo4j's Graph Query Language: An Introduction to Cypher<https://neo4j.com/developer/cypher-query-language/>
neo4j.com
Master the basics of Cypher – the graph query language for Neo4j – with this introductory guide that teaches you how to read and write Cypher queries.



> Neo4j's Graph Query Language: An Introduction to Cypher<https://neo4j.com/developer/cypher-query-language/>
> neo4j.com
> Master the basics of Cypher – the graph query language for Neo4j – with this introductory guide that teaches you how to read and write Cypher queries.
>
>
>
>
> Would love to get some discussion going on 2-5.
> Thanks
>
>> From: mpercy@apache.org
>> Date: Wed, 29 Jun 2016 17:24:16 -0700
>> Subject: Re: [Discuss graph source/sink design proposal]
>> To: dev@flume.apache.org
>>
>> Hmm, maybe a different Kudu project? Not sure.
>>
>> Anyway, this type of "changelog" thing would require support in the DB for
>> streaming its write-ahead log or something. For example, we don't support
>> that in Apache Kudu (incubating) -- maybe someday.
>>
>> Regarding Flume, I usually think it's useful to distinguish between a
>> source and a sink. They are typically written as separate classes and they
>> represent different interfaces at the Flume Java API level.
>>
>> So, how would one write a streaming database source? That really depends on
>> the database and the APIs it provides for that.
>>
>> Mike
>>
>> On Tue, Jun 28, 2016 at 8:30 AM, Saikat Kanjilal <sx...@hotmail.com>
>> wrote:
>>
>>> :) I'm using Kudu at work at the moment to troubleshoot some Tomcat
>>> issues,  regarding the where to keep the source code I would say for now
>>> lets go with the plugin approach and revisit the "where does the code live"
>>> conversation later.  One thing I do want to discuss is that the plugin will
>>> act as a source or a sink depending on configuration, so if the plugin acts
>>> as a source we need a mechanism (like a daemon in syslog) to stream changes
>>> real time from a graphdb into flume, I was wondering if there are any past
>>> approaches around this that I can follow, I may need to dig into the neo4j
>>> kernel to see where we can inject something like this.
>>> Thoughts on that?
>>>
>>>> From: mpercy@apache.org
>>>> Date: Tue, 28 Jun 2016 00:27:45 -0700
>>>> Subject: Re: [Discuss graph source/sink design proposal]
>>>> To: dev@flume.apache.org
>>>>
>>>> Hi Saikat,
>>>> Please see my thoughts inline. This is how I think about this stuff;
>>> others
>>>> may think about it differently.
>>>>
>>>> On Mon, Jun 27, 2016 at 8:45 PM, Saikat Kanjilal <sx...@hotmail.com>
>>>> wrote:
>>>>
>>>>> Exactly right, I'm proposing we create a graph sink for flume while
>>>>> keeping the flume core intact.
>>>>
>>>>
>>>> As you are probably aware, sources and sinks don't have to be part of the
>>>> main Apache Flume source tree to be used with Flume. The plugins.d
>>>> mechanism described in [1] makes building and integrating separate
>>> plugins
>>>> into Flume an easy thing to do at deployment time.
>>>>
>>>> In another project I work on, Apache Kudu (incubating), we have a Flume
>>>> Kudu sink committed in the main source tree [2]. We may at some point
>>>> propose to move it into the Flume source tree, but for now (for testing
>>> and
>>>> API stability reasons) it's easier to keep it in the Kudu source tree.
>>>>
>>>> Likewise, you could implement a Flume Neo4J sink and post it up on GitHub
>>>> (or maybe in the Neo4J tree?). Donating it to the Apache Flume project
>>> once
>>>> it's in decent shape may make sense at some point, especially if the
>>>> dependencies are easy to share and integrate into the Flume project.
>>>> However, I wouldn't say that it's a foregone conclusion that it really
>>>> needs to be part of the Flume source tree. Assuming you need the sink,
>>> and
>>>> are going to implement it anyway, then maybe we can defer the discussion
>>> of
>>>> whether to include it in the Flume source tree until later. One of the
>>>> things I try to keep in mind when integrating new plugin code is whether
>>>> the project will be able to support the maintenance burden of the new
>>> code.
>>>>
>>>> In reading from a graph db we need a mechanism to stream data from the
>>>>> graph store into flume.
>>>>
>>>> Yes, I'd say it could potentially make sense to create a Flume Neo4J
>>> source
>>>> as well. I think the same logic as above would still apply.
>>>>
>>>> Regards,
>>>> Mike
>>>>
>>>> [1]
>>> https://flume.apache.org/FlumeUserGuide.html#installing-third-party-plugins
>>>> [2]
>>> https://github.com/apache/incubator-kudu/tree/master/java/kudu-flume-sink
>

Re: [Discuss graph source/sink design proposal]

Posted by Mike Percy <mp...@apache.org>.

Hi Saikat,
I recommend you use GitHub. Private branches in ASF repos are only available to committers.

Regarding forking Flume, you should not need to do that. Just depend on flume-ng-core in your pom and extend AbstractSink. Maven will pull in your deps.

I'm out of town for the next few days but I'll try to respond in more detail to your design notes when I'm back in town.

Mike

Sent from my iPhone

> On Jul 4, 2016, at 6:59 AM, Saikat Kanjilal <sx...@hotmail.com> wrote:
> 
> Hari/Mike et al,
> 
> I need a place to put interim checkins related to this work, is it possible to get write privileges into a private branch so that I can commit my code at intermediate junctures, I can also put it in bitbucket but would rather not have to create yet another place for the code to live if it'll eventually end up in the flume repo.
> 
> 
> Thanks in advance
> 
> 
> ________________________________
> From: Saikat Kanjilal <sx...@hotmail.com>
> Sent: Thursday, June 30, 2016 10:16 PM
> To: dev@flume.apache.org
> Subject: RE: [Discuss graph source/sink design proposal]
> 
> So I've started the coding efforts on this, here's some details:
> 1) I've cloned the hbase sink for now and am refactoring all of that code to work with neo4j as a start2) I'm only focusing on creating a sink that will perform basic CRUD streaming operations into neo4j3) I've sent an email to the neo4j guys to figure out details around building a streaming architecture with the neo4j kernel4) In the meantime how would you guys like to review the code, I've cloned the flume repo and have created a branch called flume-2035 where I will work, should I put all the code in bitbucket and send out periodic reviews, this is going to be a sizeable effort5) How should we think about cipher related workflows as it relates to the streaming data coming in , to see a ful flavor for cipher go here https://neo4j.com/developer/cypher-query-language/
> Neo4j's Graph Query Language: An Introduction to Cypher<https://neo4j.com/developer/cypher-query-language/>
> neo4j.com
> Master the basics of Cypher – the graph query language for Neo4j – with this introductory guide that teaches you how to read and write Cypher queries.
> 
> 
> 
> 
> Would love to get some discussion going on 2-5.
> Thanks
> 
>> From: mpercy@apache.org
>> Date: Wed, 29 Jun 2016 17:24:16 -0700
>> Subject: Re: [Discuss graph source/sink design proposal]
>> To: dev@flume.apache.org
>> 
>> Hmm, maybe a different Kudu project? Not sure.
>> 
>> Anyway, this type of "changelog" thing would require support in the DB for
>> streaming its write-ahead log or something. For example, we don't support
>> that in Apache Kudu (incubating) -- maybe someday.
>> 
>> Regarding Flume, I usually think it's useful to distinguish between a
>> source and a sink. They are typically written as separate classes and they
>> represent different interfaces at the Flume Java API level.
>> 
>> So, how would one write a streaming database source? That really depends on
>> the database and the APIs it provides for that.
>> 
>> Mike
>> 
>> On Tue, Jun 28, 2016 at 8:30 AM, Saikat Kanjilal <sx...@hotmail.com>
>> wrote:
>> 
>>> :) I'm using Kudu at work at the moment to troubleshoot some Tomcat
>>> issues,  regarding the where to keep the source code I would say for now
>>> lets go with the plugin approach and revisit the "where does the code live"
>>> conversation later.  One thing I do want to discuss is that the plugin will
>>> act as a source or a sink depending on configuration, so if the plugin acts
>>> as a source we need a mechanism (like a daemon in syslog) to stream changes
>>> real time from a graphdb into flume, I was wondering if there are any past
>>> approaches around this that I can follow, I may need to dig into the neo4j
>>> kernel to see where we can inject something like this.
>>> Thoughts on that?
>>> 
>>>> From: mpercy@apache.org
>>>> Date: Tue, 28 Jun 2016 00:27:45 -0700
>>>> Subject: Re: [Discuss graph source/sink design proposal]
>>>> To: dev@flume.apache.org
>>>> 
>>>> Hi Saikat,
>>>> Please see my thoughts inline. This is how I think about this stuff;
>>> others
>>>> may think about it differently.
>>>> 
>>>> On Mon, Jun 27, 2016 at 8:45 PM, Saikat Kanjilal <sx...@hotmail.com>
>>>> wrote:
>>>> 
>>>>> Exactly right, I'm proposing we create a graph sink for flume while
>>>>> keeping the flume core intact.
>>>> 
>>>> 
>>>> As you are probably aware, sources and sinks don't have to be part of the
>>>> main Apache Flume source tree to be used with Flume. The plugins.d
>>>> mechanism described in [1] makes building and integrating separate
>>> plugins
>>>> into Flume an easy thing to do at deployment time.
>>>> 
>>>> In another project I work on, Apache Kudu (incubating), we have a Flume
>>>> Kudu sink committed in the main source tree [2]. We may at some point
>>>> propose to move it into the Flume source tree, but for now (for testing
>>> and
>>>> API stability reasons) it's easier to keep it in the Kudu source tree.
>>>> 
>>>> Likewise, you could implement a Flume Neo4J sink and post it up on GitHub
>>>> (or maybe in the Neo4J tree?). Donating it to the Apache Flume project
>>> once
>>>> it's in decent shape may make sense at some point, especially if the
>>>> dependencies are easy to share and integrate into the Flume project.
>>>> However, I wouldn't say that it's a foregone conclusion that it really
>>>> needs to be part of the Flume source tree. Assuming you need the sink,
>>> and
>>>> are going to implement it anyway, then maybe we can defer the discussion
>>> of
>>>> whether to include it in the Flume source tree until later. One of the
>>>> things I try to keep in mind when integrating new plugin code is whether
>>>> the project will be able to support the maintenance burden of the new
>>> code.
>>>> 
>>>> In reading from a graph db we need a mechanism to stream data from the
>>>>> graph store into flume.
>>>> 
>>>> Yes, I'd say it could potentially make sense to create a Flume Neo4J
>>> source
>>>> as well. I think the same logic as above would still apply.
>>>> 
>>>> Regards,
>>>> Mike
>>>> 
>>>> [1]
>>> https://flume.apache.org/FlumeUserGuide.html#installing-third-party-plugins
>>>> [2]
>>> https://github.com/apache/incubator-kudu/tree/master/java/kudu-flume-sink
>

Re: [Discuss graph source/sink design proposal]

Posted by Saikat Kanjilal <sx...@hotmail.com>.

Hari/Mike et al,

I need a place to put interim checkins related to this work, is it possible to get write privileges into a private branch so that I can commit my code at intermediate junctures, I can also put it in bitbucket but would rather not have to create yet another place for the code to live if it'll eventually end up in the flume repo.


Thanks in advance


________________________________
From: Saikat Kanjilal <sx...@hotmail.com>
Sent: Thursday, June 30, 2016 10:16 PM
To: dev@flume.apache.org
Subject: RE: [Discuss graph source/sink design proposal]

So I've started the coding efforts on this, here's some details:
1) I've cloned the hbase sink for now and am refactoring all of that code to work with neo4j as a start2) I'm only focusing on creating a sink that will perform basic CRUD streaming operations into neo4j3) I've sent an email to the neo4j guys to figure out details around building a streaming architecture with the neo4j kernel4) In the meantime how would you guys like to review the code, I've cloned the flume repo and have created a branch called flume-2035 where I will work, should I put all the code in bitbucket and send out periodic reviews, this is going to be a sizeable effort5) How should we think about cipher related workflows as it relates to the streaming data coming in , to see a ful flavor for cipher go here https://neo4j.com/developer/cypher-query-language/
Neo4j's Graph Query Language: An Introduction to Cypher<https://neo4j.com/developer/cypher-query-language/>
neo4j.com
Master the basics of Cypher – the graph query language for Neo4j – with this introductory guide that teaches you how to read and write Cypher queries.




Would love to get some discussion going on 2-5.
Thanks

> From: mpercy@apache.org
> Date: Wed, 29 Jun 2016 17:24:16 -0700
> Subject: Re: [Discuss graph source/sink design proposal]
> To: dev@flume.apache.org
>
> Hmm, maybe a different Kudu project? Not sure.
>
> Anyway, this type of "changelog" thing would require support in the DB for
> streaming its write-ahead log or something. For example, we don't support
> that in Apache Kudu (incubating) -- maybe someday.
>
> Regarding Flume, I usually think it's useful to distinguish between a
> source and a sink. They are typically written as separate classes and they
> represent different interfaces at the Flume Java API level.
>
> So, how would one write a streaming database source? That really depends on
> the database and the APIs it provides for that.
>
> Mike
>
> On Tue, Jun 28, 2016 at 8:30 AM, Saikat Kanjilal <sx...@hotmail.com>
> wrote:
>
> > :) I'm using Kudu at work at the moment to troubleshoot some Tomcat
> > issues,  regarding the where to keep the source code I would say for now
> > lets go with the plugin approach and revisit the "where does the code live"
> > conversation later.  One thing I do want to discuss is that the plugin will
> > act as a source or a sink depending on configuration, so if the plugin acts
> > as a source we need a mechanism (like a daemon in syslog) to stream changes
> > real time from a graphdb into flume, I was wondering if there are any past
> > approaches around this that I can follow, I may need to dig into the neo4j
> > kernel to see where we can inject something like this.
> > Thoughts on that?
> >
> > > From: mpercy@apache.org
> > > Date: Tue, 28 Jun 2016 00:27:45 -0700
> > > Subject: Re: [Discuss graph source/sink design proposal]
> > > To: dev@flume.apache.org
> > >
> > > Hi Saikat,
> > > Please see my thoughts inline. This is how I think about this stuff;
> > others
> > > may think about it differently.
> > >
> > > On Mon, Jun 27, 2016 at 8:45 PM, Saikat Kanjilal <sx...@hotmail.com>
> > > wrote:
> > >
> > > > Exactly right, I'm proposing we create a graph sink for flume while
> > > > keeping the flume core intact.
> > >
> > >
> > > As you are probably aware, sources and sinks don't have to be part of the
> > > main Apache Flume source tree to be used with Flume. The plugins.d
> > > mechanism described in [1] makes building and integrating separate
> > plugins
> > > into Flume an easy thing to do at deployment time.
> > >
> > > In another project I work on, Apache Kudu (incubating), we have a Flume
> > > Kudu sink committed in the main source tree [2]. We may at some point
> > > propose to move it into the Flume source tree, but for now (for testing
> > and
> > > API stability reasons) it's easier to keep it in the Kudu source tree.
> > >
> > > Likewise, you could implement a Flume Neo4J sink and post it up on GitHub
> > > (or maybe in the Neo4J tree?). Donating it to the Apache Flume project
> > once
> > > it's in decent shape may make sense at some point, especially if the
> > > dependencies are easy to share and integrate into the Flume project.
> > > However, I wouldn't say that it's a foregone conclusion that it really
> > > needs to be part of the Flume source tree. Assuming you need the sink,
> > and
> > > are going to implement it anyway, then maybe we can defer the discussion
> > of
> > > whether to include it in the Flume source tree until later. One of the
> > > things I try to keep in mind when integrating new plugin code is whether
> > > the project will be able to support the maintenance burden of the new
> > code.
> > >
> > > In reading from a graph db we need a mechanism to stream data from the
> > > > graph store into flume.
> > > >
> > >
> > > Yes, I'd say it could potentially make sense to create a Flume Neo4J
> > source
> > > as well. I think the same logic as above would still apply.
> > >
> > > Regards,
> > > Mike
> > >
> > > [1]
> > >
> > https://flume.apache.org/FlumeUserGuide.html#installing-third-party-plugins
> > > [2]
> > >
> > https://github.com/apache/incubator-kudu/tree/master/java/kudu-flume-sink
> >
> >