You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@nifi.apache.org by Shashi Vishwakarma <sh...@gmail.com> on 2017/06/04 02:43:33 UTC

Spark Streaming with Nifi

Hi

I am looking for way where I can make use of spark streaming in Nifi. I see
couple of post where SiteToSite tcp connection is used for spark streaming
application but I thinking it will be good If I can launch Spark streaming
from Nifi custom processor.

PublishKafka will publish message into Kafka followed by Nifi Spark
streaming processor will read from Kafka Topic.

I can launch Spark streaming application from custom Nifi processor using
Spark Streaming launcher API but biggest challenge is that it will create
spark streaming context for each flow file which can be costly operation.

Does any one suggest storing spark streaming context  in controller service
? or any better approach for running spark streaming application with Nifi ?

Thanks and Regards,
Shashi

Re: Spark Streaming with Nifi

Posted by Andrew Psaltis <ps...@gmail.com>.

Ahh --- sorry if I had confused matters earlier. Feel free to reach out if
you get to a sticking point.

Thanks,
Andrew

On Thu, Jun 8, 2017 at 3:32 PM, Shashi Vishwakarma <shashi.vish123@gmail.com
> wrote:

> Thanks Andrew.
>
> Things are pretty clear now. At low level I need to write piece of java
> code which will create json structure similar to Nifi provenance event and
> will send it to another store .
>
> I was under impression that Nifi has flexibility of updating provenance
> store using API calls.
>
> Thanks
> Shashi
>
>
> On Thu, Jun 8, 2017 at 12:37 PM, Andrew Psaltis <ps...@gmail.com>
> wrote:
>
>> Hi Shashi,
>> At this time you cannot write a provenance event into the NiFi provenance
>> repository which is stored locally on the node that is processing the data.
>> The repository is internal to NiFi, that is why I was suggesting create a "*Spark
>> Provenance Event" *that you write to the same external store therefore
>> you can have all the data in one place. However, the data coming from Spark
>> will certainly be different. More information on the provenance repository
>> usage can be found here [1] and the design here [2].
>>
>> Hope that helps.
>>
>> [1] https://nifi.apache.org/docs/nifi-docs/html/user-guide.
>> html#data_provenance
>> [2] https://cwiki.apache.org/confluence/display/NIFI/Persist
>> ent+Provenance+Repository+Design
>>
>> Thanks,
>> Andrew
>>
>> On Thu, Jun 8, 2017 at 6:50 AM, Shashi Vishwakarma <
>> shashi.vish123@gmail.com> wrote:
>>
>>> Hi Andrew,
>>>
>>> Regarding creating spark provenance event,
>>>
>>> *"let's call it a Spark Provenance Event -- in this you can populate as
>>> much data as you have and write that to a similar data store."*
>>>
>>> Is there any way I can write my spark provenance event to Nifi
>>> provenance store with some EventId ?
>>>
>>> I have ReportingTask which sends event to another application but it
>>> relies on Nifi provenance store. I am thinking  that spark job will emit
>>> provenance event which will be written in Nifi provenance store and
>>> reporting task will send that to another application.
>>>
>>> Apologies if my use case is still unclear.
>>>
>>> Thanks
>>> Shashi
>>>
>>>
>>>
>>> On Thu, Jun 8, 2017 at 3:23 AM, Andrew Psaltis <psaltis.andrew@gmail.com
>>> > wrote:
>>>
>>>> Hi Shashi,
>>>> Regarding your upgrade question, I may have confused things. When
>>>> emitting a "provenance" event from your Spark Streaming job, this will not
>>>> be the same exact event as that emitted from NiFi. I was referencing the
>>>> code in the previous email to give insight into the details NiFi does
>>>> provide. In your Spark application you will not have all of the information
>>>> to populate a NiFi Provenance event. Therefore, for your Spark code you can
>>>> come up with a new event, let's call it a Spark Provenance Event -- in this
>>>> you can populate as much data as you have and write that to a similar data
>>>> store. For example you would want a timestamp, the component can be Spark
>>>> and any other data you need to emit. Basically, you will be combing the
>>>> NiFi provenance data with your customer spark provenance data to create a
>>>> complete picture.
>>>>
>>>> As far as the lineage goes, again your Spark streaming code will be
>>>> executing outside of NiFi and you will have to write this into some other
>>>> store, perhaps to Atlas and then you can have the lineage for both NiFi and
>>>> Spark. This [1] is an example NiFi reporting tasks that sends lineage data
>>>> to Atlas, you could extend this concept to work with Spark as well.
>>>>
>>>> Hopefully this helps clarify some things, sorry if my previous email
>>>> was not completely clear.
>>>>
>>>> Thanks
>>>> Andrew
>>>>
>>>> On Wed, Jun 7, 2017 at 4:31 PM, Shashi Vishwakarma <
>>>> shashi.vish123@gmail.com> wrote:
>>>>
>>>>> Thanks a lot Andrew. This is something I was looking for.
>>>>>
>>>>> I have two question at point keeping in mind I have generate
>>>>> provenance event.
>>>>>
>>>>> 1. How will I manage upgrade ? If I generate custom provenance and
>>>>> Nifi community made significant changes in Nifi provenance structure ?
>>>>>
>>>>> 2. How do I get lineage information ?
>>>>>
>>>>> Thanks
>>>>> Shashi
>>>>>
>>>>>
>>>>> On Tue, Jun 6, 2017 at 7:38 PM, Andrew Psaltis <
>>>>> psaltis.andrew@gmail.com> wrote:
>>>>>
>>>>>> Hi Shashi,
>>>>>> Your assumption is correct -- you would want to send a "provenance"
>>>>>> event from your Spark job, you can see the structure of the provenance
>>>>>> events in NiFi here [1]
>>>>>>
>>>>>> Regarding the flow, if you are waiting on the Spark Streaming code to
>>>>>> compute some value before you continue you can construct it perhaps this
>>>>>> way:
>>>>>>
>>>>>>
>>>>>> [image: Inline image 1]
>>>>>>
>>>>>> Hopefully that helps to clarify it a little. In essence if you are
>>>>>> waiting on results form the Spark Streaming computation before continuing
>>>>>> you would use Kafka for the output results from Spark Streaming and then
>>>>>> consume that in NiFi and carry on with your processing.
>>>>>>
>>>>>> [1] https://github.com/apache/nifi/blob/master/nifi-nar-bund
>>>>>> les/nifi-framework-bundle/nifi-framework/nifi-client-dto/src
>>>>>> /main/java/org/apache/nifi/web/api/dto/provenance/Provenance
>>>>>> EventDTO.java
>>>>>>
>>>>>> Thanks,
>>>>>> Andrew
>>>>>>
>>>>>> On Mon, Jun 5, 2017 at 9:46 AM, Shashi Vishwakarma <
>>>>>> shashi.vish123@gmail.com> wrote:
>>>>>>
>>>>>>> Hi Andrew,
>>>>>>>
>>>>>>> I am trying to understand here bit more in detail. Essentially I
>>>>>>> will have to write some custom code in my spark streaming job and construct
>>>>>>> provenance event and send it to some store like Hbase,PubSub system to be
>>>>>>> consumed by others.
>>>>>>>
>>>>>>> Is that correct ?
>>>>>>>
>>>>>>> If yes how do I execute other processor which are present in
>>>>>>> pipeline ?
>>>>>>>
>>>>>>> Ex
>>>>>>>
>>>>>>> Nifi --> Kakfa -- > Spark Streaming --> Processor 1 --> Processor 2
>>>>>>>
>>>>>>> Thanks
>>>>>>> Shashi
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Jun 5, 2017 at 12:36 AM, Andrew Psaltis <
>>>>>>> psaltis.andrew@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi Shashi,
>>>>>>>> Thanks for the explanation.  I have a better understanding of what
>>>>>>>> you are trying to accomplish. Although Spark streaming is micro-batch, you
>>>>>>>> would not want to keep launching jobs for each batch.   Think of it as the
>>>>>>>> Spark scheduler having a while loop in which it executes your job then
>>>>>>>> sleeps for X amount of time based on the interval you configure.
>>>>>>>>
>>>>>>>> Perhaps a better way would be to do the following:
>>>>>>>> 1. Use the S2S ProvenanceReportingTask to send provenance
>>>>>>>> information from your NiFi instance to a second instance or cluster.
>>>>>>>> 2. In the second NiFi instance/cluster ( the one receiving the
>>>>>>>> provenance data) you write the data into say HBase or Solr or system X.
>>>>>>>> 3. In your Spark streaming job you right into the same data store a
>>>>>>>> "provenance" event -- obviously this will not have all the fields that a
>>>>>>>> true NiFi provenance record does, but you can come close.
>>>>>>>>
>>>>>>>> With this then once you would then have all provenance data in an
>>>>>>>> external system that you can query to understand the whole system.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Andrew
>>>>>>>>
>>>>>>>> P.S. sorry if this is choppy or not well formed, on mobile.
>>>>>>>>
>>>>>>>> On Sun, Jun 4, 2017 at 17:46 Shashi Vishwakarma <
>>>>>>>> shashi.vish123@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Thanks Andrew.
>>>>>>>>>
>>>>>>>>> I agree that decoupling component is good solution from long term
>>>>>>>>> perspective. My current data pipeline in Nifi is designed for batch
>>>>>>>>> processing which I am trying to convert into streaming model.
>>>>>>>>>
>>>>>>>>> One of the processor in data pipeline invokes Spark job , once job
>>>>>>>>> finished control  is returned to Nifi processor in turn which generates
>>>>>>>>> provenance event for job. This provenance event is important for us.
>>>>>>>>>
>>>>>>>>> Keeping batch model architecture in mind, I want to designed spark
>>>>>>>>> streaming based model in which Nifi Spark streaming processor will process
>>>>>>>>> micro batch and job status will be returned to Nifi with provenance event.
>>>>>>>>> Then I can capture that provenance data for my reports.
>>>>>>>>>
>>>>>>>>> Essentially I will be using Nifi for capturing provenance event
>>>>>>>>> where actual processing will be done by Spark streaming job.
>>>>>>>>>
>>>>>>>>> Do you see this approach logical ?
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>> Shashi
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sun, Jun 4, 2017 at 3:10 PM, Andrew Psaltis <
>>>>>>>>> psaltis.andrew@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Shashi,
>>>>>>>>>> I'm sure there is a way to make this work. However, my first
>>>>>>>>>> question is why you would want to? By design a Spark Streaming application
>>>>>>>>>> should always be running and consuming data from some source, hence the
>>>>>>>>>> notion of streaming. Tying Spark Streaming to NiFi would ultimately result
>>>>>>>>>> in a more coupled and fragile architecture. Perhaps a different way to
>>>>>>>>>> think about it would be to set things up like this:
>>>>>>>>>>
>>>>>>>>>> NiFi --> Kafka <-- Spark Streaming
>>>>>>>>>>
>>>>>>>>>> With this you can do what you are doing today -- using NiFi to
>>>>>>>>>> ingest, transform, make routing decisions, and feed data into Kafka. In
>>>>>>>>>> essence you would be using NiFi to do all the preparation of the data for
>>>>>>>>>> Spark Streaming. Kafka would serve the purpose of a buffer between NiFi and
>>>>>>>>>> Spark Streaming. Finally, Spark Streaming would ingest data from Kafka and
>>>>>>>>>> do what it is designed for -- stream processing. Having a decoupled
>>>>>>>>>> architecture like this also allows you to manage each tier separately, thus
>>>>>>>>>> you can tune, scale, develop, and deploy all separately.
>>>>>>>>>>
>>>>>>>>>> I know I did not directly answer your question on how to make it
>>>>>>>>>> work. But, hopefully this helps provide an approach that will be a better
>>>>>>>>>> long term solution. There may be something I am missing in your initial
>>>>>>>>>> questions.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Andrew
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Sat, Jun 3, 2017 at 10:43 PM, Shashi Vishwakarma <
>>>>>>>>>> shashi.vish123@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi
>>>>>>>>>>>
>>>>>>>>>>> I am looking for way where I can make use of spark streaming in
>>>>>>>>>>> Nifi. I see couple of post where SiteToSite tcp connection is used for
>>>>>>>>>>> spark streaming application but I thinking it will be good If I can launch
>>>>>>>>>>> Spark streaming from Nifi custom processor.
>>>>>>>>>>>
>>>>>>>>>>> PublishKafka will publish message into Kafka followed by Nifi
>>>>>>>>>>> Spark streaming processor will read from Kafka Topic.
>>>>>>>>>>>
>>>>>>>>>>> I can launch Spark streaming application from custom Nifi
>>>>>>>>>>> processor using Spark Streaming launcher API but biggest challenge is that
>>>>>>>>>>> it will create spark streaming context for each flow file which can be
>>>>>>>>>>> costly operation.
>>>>>>>>>>>
>>>>>>>>>>> Does any one suggest storing spark streaming context  in
>>>>>>>>>>> controller service ? or any better approach for running spark streaming
>>>>>>>>>>> application with Nifi ?
>>>>>>>>>>>
>>>>>>>>>>> Thanks and Regards,
>>>>>>>>>>> Shashi
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Thanks,
>>>>>>>>>> Andrew
>>>>>>>>>>
>>>>>>>>>> Subscribe to my book: Streaming Data <http://manning.com/psaltis>
>>>>>>>>>> <https://www.linkedin.com/pub/andrew-psaltis/1/17b/306>
>>>>>>>>>> twiiter: @itmdata
>>>>>>>>>> <http://twitter.com/intent/user?screen_name=itmdata>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>> Thanks,
>>>>>>>> Andrew
>>>>>>>>
>>>>>>>> Subscribe to my book: Streaming Data <http://manning.com/psaltis>
>>>>>>>> <https://www.linkedin.com/pub/andrew-psaltis/1/17b/306>
>>>>>>>> twiiter: @itmdata
>>>>>>>> <http://twitter.com/intent/user?screen_name=itmdata>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Thanks,
>>>>>> Andrew
>>>>>>
>>>>>> Subscribe to my book: Streaming Data <http://manning.com/psaltis>
>>>>>> <https://www.linkedin.com/pub/andrew-psaltis/1/17b/306>
>>>>>> twiiter: @itmdata
>>>>>> <http://twitter.com/intent/user?screen_name=itmdata>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Thanks,
>>>> Andrew
>>>>
>>>> Subscribe to my book: Streaming Data <http://manning.com/psaltis>
>>>> <https://www.linkedin.com/pub/andrew-psaltis/1/17b/306>
>>>> twiiter: @itmdata <http://twitter.com/intent/user?screen_name=itmdata>
>>>>
>>>
>>>
>>
>>
>> --
>> Thanks,
>> Andrew
>>
>> Subscribe to my book: Streaming Data <http://manning.com/psaltis>
>> <https://www.linkedin.com/pub/andrew-psaltis/1/17b/306>
>> twiiter: @itmdata <http://twitter.com/intent/user?screen_name=itmdata>
>>
>
>


-- 
Thanks,
Andrew

Subscribe to my book: Streaming Data <http://manning.com/psaltis>
<https://www.linkedin.com/pub/andrew-psaltis/1/17b/306>
twiiter: @itmdata <http://twitter.com/intent/user?screen_name=itmdata>

Re: Spark Streaming with Nifi

Posted by Shashi Vishwakarma <sh...@gmail.com>.

Thanks Andrew.

Things are pretty clear now. At low level I need to write piece of java
code which will create json structure similar to Nifi provenance event and
will send it to another store .

I was under impression that Nifi has flexibility of updating provenance
store using API calls.

Thanks
Shashi


On Thu, Jun 8, 2017 at 12:37 PM, Andrew Psaltis <ps...@gmail.com>
wrote:

> Hi Shashi,
> At this time you cannot write a provenance event into the NiFi provenance
> repository which is stored locally on the node that is processing the data.
> The repository is internal to NiFi, that is why I was suggesting create a "*Spark
> Provenance Event" *that you write to the same external store therefore
> you can have all the data in one place. However, the data coming from Spark
> will certainly be different. More information on the provenance repository
> usage can be found here [1] and the design here [2].
>
> Hope that helps.
>
> [1] https://nifi.apache.org/docs/nifi-docs/html/user-
> guide.html#data_provenance
> [2] https://cwiki.apache.org/confluence/display/NIFI/
> Persistent+Provenance+Repository+Design
>
> Thanks,
> Andrew
>
> On Thu, Jun 8, 2017 at 6:50 AM, Shashi Vishwakarma <
> shashi.vish123@gmail.com> wrote:
>
>> Hi Andrew,
>>
>> Regarding creating spark provenance event,
>>
>> *"let's call it a Spark Provenance Event -- in this you can populate as
>> much data as you have and write that to a similar data store."*
>>
>> Is there any way I can write my spark provenance event to Nifi provenance
>> store with some EventId ?
>>
>> I have ReportingTask which sends event to another application but it
>> relies on Nifi provenance store. I am thinking  that spark job will emit
>> provenance event which will be written in Nifi provenance store and
>> reporting task will send that to another application.
>>
>> Apologies if my use case is still unclear.
>>
>> Thanks
>> Shashi
>>
>>
>>
>> On Thu, Jun 8, 2017 at 3:23 AM, Andrew Psaltis <ps...@gmail.com>
>> wrote:
>>
>>> Hi Shashi,
>>> Regarding your upgrade question, I may have confused things. When
>>> emitting a "provenance" event from your Spark Streaming job, this will not
>>> be the same exact event as that emitted from NiFi. I was referencing the
>>> code in the previous email to give insight into the details NiFi does
>>> provide. In your Spark application you will not have all of the information
>>> to populate a NiFi Provenance event. Therefore, for your Spark code you can
>>> come up with a new event, let's call it a Spark Provenance Event -- in this
>>> you can populate as much data as you have and write that to a similar data
>>> store. For example you would want a timestamp, the component can be Spark
>>> and any other data you need to emit. Basically, you will be combing the
>>> NiFi provenance data with your customer spark provenance data to create a
>>> complete picture.
>>>
>>> As far as the lineage goes, again your Spark streaming code will be
>>> executing outside of NiFi and you will have to write this into some other
>>> store, perhaps to Atlas and then you can have the lineage for both NiFi and
>>> Spark. This [1] is an example NiFi reporting tasks that sends lineage data
>>> to Atlas, you could extend this concept to work with Spark as well.
>>>
>>> Hopefully this helps clarify some things, sorry if my previous email was
>>> not completely clear.
>>>
>>> Thanks
>>> Andrew
>>>
>>> On Wed, Jun 7, 2017 at 4:31 PM, Shashi Vishwakarma <
>>> shashi.vish123@gmail.com> wrote:
>>>
>>>> Thanks a lot Andrew. This is something I was looking for.
>>>>
>>>> I have two question at point keeping in mind I have generate provenance
>>>> event.
>>>>
>>>> 1. How will I manage upgrade ? If I generate custom provenance and Nifi
>>>> community made significant changes in Nifi provenance structure ?
>>>>
>>>> 2. How do I get lineage information ?
>>>>
>>>> Thanks
>>>> Shashi
>>>>
>>>>
>>>> On Tue, Jun 6, 2017 at 7:38 PM, Andrew Psaltis <
>>>> psaltis.andrew@gmail.com> wrote:
>>>>
>>>>> Hi Shashi,
>>>>> Your assumption is correct -- you would want to send a "provenance"
>>>>> event from your Spark job, you can see the structure of the provenance
>>>>> events in NiFi here [1]
>>>>>
>>>>> Regarding the flow, if you are waiting on the Spark Streaming code to
>>>>> compute some value before you continue you can construct it perhaps this
>>>>> way:
>>>>>
>>>>>
>>>>> [image: Inline image 1]
>>>>>
>>>>> Hopefully that helps to clarify it a little. In essence if you are
>>>>> waiting on results form the Spark Streaming computation before continuing
>>>>> you would use Kafka for the output results from Spark Streaming and then
>>>>> consume that in NiFi and carry on with your processing.
>>>>>
>>>>> [1] https://github.com/apache/nifi/blob/master/nifi-nar-bund
>>>>> les/nifi-framework-bundle/nifi-framework/nifi-client-dto/src
>>>>> /main/java/org/apache/nifi/web/api/dto/provenance/Provenance
>>>>> EventDTO.java
>>>>>
>>>>> Thanks,
>>>>> Andrew
>>>>>
>>>>> On Mon, Jun 5, 2017 at 9:46 AM, Shashi Vishwakarma <
>>>>> shashi.vish123@gmail.com> wrote:
>>>>>
>>>>>> Hi Andrew,
>>>>>>
>>>>>> I am trying to understand here bit more in detail. Essentially I will
>>>>>> have to write some custom code in my spark streaming job and construct
>>>>>> provenance event and send it to some store like Hbase,PubSub system to be
>>>>>> consumed by others.
>>>>>>
>>>>>> Is that correct ?
>>>>>>
>>>>>> If yes how do I execute other processor which are present in pipeline
>>>>>> ?
>>>>>>
>>>>>> Ex
>>>>>>
>>>>>> Nifi --> Kakfa -- > Spark Streaming --> Processor 1 --> Processor 2
>>>>>>
>>>>>> Thanks
>>>>>> Shashi
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Mon, Jun 5, 2017 at 12:36 AM, Andrew Psaltis <
>>>>>> psaltis.andrew@gmail.com> wrote:
>>>>>>
>>>>>>> Hi Shashi,
>>>>>>> Thanks for the explanation.  I have a better understanding of what
>>>>>>> you are trying to accomplish. Although Spark streaming is micro-batch, you
>>>>>>> would not want to keep launching jobs for each batch.   Think of it as the
>>>>>>> Spark scheduler having a while loop in which it executes your job then
>>>>>>> sleeps for X amount of time based on the interval you configure.
>>>>>>>
>>>>>>> Perhaps a better way would be to do the following:
>>>>>>> 1. Use the S2S ProvenanceReportingTask to send provenance
>>>>>>> information from your NiFi instance to a second instance or cluster.
>>>>>>> 2. In the second NiFi instance/cluster ( the one receiving the
>>>>>>> provenance data) you write the data into say HBase or Solr or system X.
>>>>>>> 3. In your Spark streaming job you right into the same data store a
>>>>>>> "provenance" event -- obviously this will not have all the fields that a
>>>>>>> true NiFi provenance record does, but you can come close.
>>>>>>>
>>>>>>> With this then once you would then have all provenance data in an
>>>>>>> external system that you can query to understand the whole system.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Andrew
>>>>>>>
>>>>>>> P.S. sorry if this is choppy or not well formed, on mobile.
>>>>>>>
>>>>>>> On Sun, Jun 4, 2017 at 17:46 Shashi Vishwakarma <
>>>>>>> shashi.vish123@gmail.com> wrote:
>>>>>>>
>>>>>>>> Thanks Andrew.
>>>>>>>>
>>>>>>>> I agree that decoupling component is good solution from long term
>>>>>>>> perspective. My current data pipeline in Nifi is designed for batch
>>>>>>>> processing which I am trying to convert into streaming model.
>>>>>>>>
>>>>>>>> One of the processor in data pipeline invokes Spark job , once job
>>>>>>>> finished control  is returned to Nifi processor in turn which generates
>>>>>>>> provenance event for job. This provenance event is important for us.
>>>>>>>>
>>>>>>>> Keeping batch model architecture in mind, I want to designed spark
>>>>>>>> streaming based model in which Nifi Spark streaming processor will process
>>>>>>>> micro batch and job status will be returned to Nifi with provenance event.
>>>>>>>> Then I can capture that provenance data for my reports.
>>>>>>>>
>>>>>>>> Essentially I will be using Nifi for capturing provenance event
>>>>>>>> where actual processing will be done by Spark streaming job.
>>>>>>>>
>>>>>>>> Do you see this approach logical ?
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> Shashi
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sun, Jun 4, 2017 at 3:10 PM, Andrew Psaltis <
>>>>>>>> psaltis.andrew@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi Shashi,
>>>>>>>>> I'm sure there is a way to make this work. However, my first
>>>>>>>>> question is why you would want to? By design a Spark Streaming application
>>>>>>>>> should always be running and consuming data from some source, hence the
>>>>>>>>> notion of streaming. Tying Spark Streaming to NiFi would ultimately result
>>>>>>>>> in a more coupled and fragile architecture. Perhaps a different way to
>>>>>>>>> think about it would be to set things up like this:
>>>>>>>>>
>>>>>>>>> NiFi --> Kafka <-- Spark Streaming
>>>>>>>>>
>>>>>>>>> With this you can do what you are doing today -- using NiFi to
>>>>>>>>> ingest, transform, make routing decisions, and feed data into Kafka. In
>>>>>>>>> essence you would be using NiFi to do all the preparation of the data for
>>>>>>>>> Spark Streaming. Kafka would serve the purpose of a buffer between NiFi and
>>>>>>>>> Spark Streaming. Finally, Spark Streaming would ingest data from Kafka and
>>>>>>>>> do what it is designed for -- stream processing. Having a decoupled
>>>>>>>>> architecture like this also allows you to manage each tier separately, thus
>>>>>>>>> you can tune, scale, develop, and deploy all separately.
>>>>>>>>>
>>>>>>>>> I know I did not directly answer your question on how to make it
>>>>>>>>> work. But, hopefully this helps provide an approach that will be a better
>>>>>>>>> long term solution. There may be something I am missing in your initial
>>>>>>>>> questions.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Andrew
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sat, Jun 3, 2017 at 10:43 PM, Shashi Vishwakarma <
>>>>>>>>> shashi.vish123@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi
>>>>>>>>>>
>>>>>>>>>> I am looking for way where I can make use of spark streaming in
>>>>>>>>>> Nifi. I see couple of post where SiteToSite tcp connection is used for
>>>>>>>>>> spark streaming application but I thinking it will be good If I can launch
>>>>>>>>>> Spark streaming from Nifi custom processor.
>>>>>>>>>>
>>>>>>>>>> PublishKafka will publish message into Kafka followed by Nifi
>>>>>>>>>> Spark streaming processor will read from Kafka Topic.
>>>>>>>>>>
>>>>>>>>>> I can launch Spark streaming application from custom Nifi
>>>>>>>>>> processor using Spark Streaming launcher API but biggest challenge is that
>>>>>>>>>> it will create spark streaming context for each flow file which can be
>>>>>>>>>> costly operation.
>>>>>>>>>>
>>>>>>>>>> Does any one suggest storing spark streaming context  in
>>>>>>>>>> controller service ? or any better approach for running spark streaming
>>>>>>>>>> application with Nifi ?
>>>>>>>>>>
>>>>>>>>>> Thanks and Regards,
>>>>>>>>>> Shashi
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Thanks,
>>>>>>>>> Andrew
>>>>>>>>>
>>>>>>>>> Subscribe to my book: Streaming Data <http://manning.com/psaltis>
>>>>>>>>> <https://www.linkedin.com/pub/andrew-psaltis/1/17b/306>
>>>>>>>>> twiiter: @itmdata
>>>>>>>>> <http://twitter.com/intent/user?screen_name=itmdata>
>>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>> Thanks,
>>>>>>> Andrew
>>>>>>>
>>>>>>> Subscribe to my book: Streaming Data <http://manning.com/psaltis>
>>>>>>> <https://www.linkedin.com/pub/andrew-psaltis/1/17b/306>
>>>>>>> twiiter: @itmdata
>>>>>>> <http://twitter.com/intent/user?screen_name=itmdata>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Thanks,
>>>>> Andrew
>>>>>
>>>>> Subscribe to my book: Streaming Data <http://manning.com/psaltis>
>>>>> <https://www.linkedin.com/pub/andrew-psaltis/1/17b/306>
>>>>> twiiter: @itmdata <http://twitter.com/intent/user?screen_name=itmdata>
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Thanks,
>>> Andrew
>>>
>>> Subscribe to my book: Streaming Data <http://manning.com/psaltis>
>>> <https://www.linkedin.com/pub/andrew-psaltis/1/17b/306>
>>> twiiter: @itmdata <http://twitter.com/intent/user?screen_name=itmdata>
>>>
>>
>>
>
>
> --
> Thanks,
> Andrew
>
> Subscribe to my book: Streaming Data <http://manning.com/psaltis>
> <https://www.linkedin.com/pub/andrew-psaltis/1/17b/306>
> twiiter: @itmdata <http://twitter.com/intent/user?screen_name=itmdata>
>

Re: Spark Streaming with Nifi

Posted by Andrew Psaltis <ps...@gmail.com>.

Hi Shashi,
At this time you cannot write a provenance event into the NiFi provenance
repository which is stored locally on the node that is processing the data.
The repository is internal to NiFi, that is why I was suggesting
create a "*Spark
Provenance Event" *that you write to the same external store therefore you
can have all the data in one place. However, the data coming from Spark
will certainly be different. More information on the provenance repository
usage can be found here [1] and the design here [2].

Hope that helps.

[1]
https://nifi.apache.org/docs/nifi-docs/html/user-guide.html#data_provenance
[2]
https://cwiki.apache.org/confluence/display/NIFI/Persistent+Provenance+Repository+Design

Thanks,
Andrew

On Thu, Jun 8, 2017 at 6:50 AM, Shashi Vishwakarma <shashi.vish123@gmail.com
> wrote:

> Hi Andrew,
>
> Regarding creating spark provenance event,
>
> *"let's call it a Spark Provenance Event -- in this you can populate as
> much data as you have and write that to a similar data store."*
>
> Is there any way I can write my spark provenance event to Nifi provenance
> store with some EventId ?
>
> I have ReportingTask which sends event to another application but it
> relies on Nifi provenance store. I am thinking  that spark job will emit
> provenance event which will be written in Nifi provenance store and
> reporting task will send that to another application.
>
> Apologies if my use case is still unclear.
>
> Thanks
> Shashi
>
>
>
> On Thu, Jun 8, 2017 at 3:23 AM, Andrew Psaltis <ps...@gmail.com>
> wrote:
>
>> Hi Shashi,
>> Regarding your upgrade question, I may have confused things. When
>> emitting a "provenance" event from your Spark Streaming job, this will not
>> be the same exact event as that emitted from NiFi. I was referencing the
>> code in the previous email to give insight into the details NiFi does
>> provide. In your Spark application you will not have all of the information
>> to populate a NiFi Provenance event. Therefore, for your Spark code you can
>> come up with a new event, let's call it a Spark Provenance Event -- in this
>> you can populate as much data as you have and write that to a similar data
>> store. For example you would want a timestamp, the component can be Spark
>> and any other data you need to emit. Basically, you will be combing the
>> NiFi provenance data with your customer spark provenance data to create a
>> complete picture.
>>
>> As far as the lineage goes, again your Spark streaming code will be
>> executing outside of NiFi and you will have to write this into some other
>> store, perhaps to Atlas and then you can have the lineage for both NiFi and
>> Spark. This [1] is an example NiFi reporting tasks that sends lineage data
>> to Atlas, you could extend this concept to work with Spark as well.
>>
>> Hopefully this helps clarify some things, sorry if my previous email was
>> not completely clear.
>>
>> Thanks
>> Andrew
>>
>> On Wed, Jun 7, 2017 at 4:31 PM, Shashi Vishwakarma <
>> shashi.vish123@gmail.com> wrote:
>>
>>> Thanks a lot Andrew. This is something I was looking for.
>>>
>>> I have two question at point keeping in mind I have generate provenance
>>> event.
>>>
>>> 1. How will I manage upgrade ? If I generate custom provenance and Nifi
>>> community made significant changes in Nifi provenance structure ?
>>>
>>> 2. How do I get lineage information ?
>>>
>>> Thanks
>>> Shashi
>>>
>>>
>>> On Tue, Jun 6, 2017 at 7:38 PM, Andrew Psaltis <psaltis.andrew@gmail.com
>>> > wrote:
>>>
>>>> Hi Shashi,
>>>> Your assumption is correct -- you would want to send a "provenance"
>>>> event from your Spark job, you can see the structure of the provenance
>>>> events in NiFi here [1]
>>>>
>>>> Regarding the flow, if you are waiting on the Spark Streaming code to
>>>> compute some value before you continue you can construct it perhaps this
>>>> way:
>>>>
>>>>
>>>> [image: Inline image 1]
>>>>
>>>> Hopefully that helps to clarify it a little. In essence if you are
>>>> waiting on results form the Spark Streaming computation before continuing
>>>> you would use Kafka for the output results from Spark Streaming and then
>>>> consume that in NiFi and carry on with your processing.
>>>>
>>>> [1] https://github.com/apache/nifi/blob/master/nifi-nar-bund
>>>> les/nifi-framework-bundle/nifi-framework/nifi-client-dto/src
>>>> /main/java/org/apache/nifi/web/api/dto/provenance/Provenance
>>>> EventDTO.java
>>>>
>>>> Thanks,
>>>> Andrew
>>>>
>>>> On Mon, Jun 5, 2017 at 9:46 AM, Shashi Vishwakarma <
>>>> shashi.vish123@gmail.com> wrote:
>>>>
>>>>> Hi Andrew,
>>>>>
>>>>> I am trying to understand here bit more in detail. Essentially I will
>>>>> have to write some custom code in my spark streaming job and construct
>>>>> provenance event and send it to some store like Hbase,PubSub system to be
>>>>> consumed by others.
>>>>>
>>>>> Is that correct ?
>>>>>
>>>>> If yes how do I execute other processor which are present in pipeline
>>>>> ?
>>>>>
>>>>> Ex
>>>>>
>>>>> Nifi --> Kakfa -- > Spark Streaming --> Processor 1 --> Processor 2
>>>>>
>>>>> Thanks
>>>>> Shashi
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Jun 5, 2017 at 12:36 AM, Andrew Psaltis <
>>>>> psaltis.andrew@gmail.com> wrote:
>>>>>
>>>>>> Hi Shashi,
>>>>>> Thanks for the explanation.  I have a better understanding of what
>>>>>> you are trying to accomplish. Although Spark streaming is micro-batch, you
>>>>>> would not want to keep launching jobs for each batch.   Think of it as the
>>>>>> Spark scheduler having a while loop in which it executes your job then
>>>>>> sleeps for X amount of time based on the interval you configure.
>>>>>>
>>>>>> Perhaps a better way would be to do the following:
>>>>>> 1. Use the S2S ProvenanceReportingTask to send provenance information
>>>>>> from your NiFi instance to a second instance or cluster.
>>>>>> 2. In the second NiFi instance/cluster ( the one receiving the
>>>>>> provenance data) you write the data into say HBase or Solr or system X.
>>>>>> 3. In your Spark streaming job you right into the same data store a
>>>>>> "provenance" event -- obviously this will not have all the fields that a
>>>>>> true NiFi provenance record does, but you can come close.
>>>>>>
>>>>>> With this then once you would then have all provenance data in an
>>>>>> external system that you can query to understand the whole system.
>>>>>>
>>>>>> Thanks,
>>>>>> Andrew
>>>>>>
>>>>>> P.S. sorry if this is choppy or not well formed, on mobile.
>>>>>>
>>>>>> On Sun, Jun 4, 2017 at 17:46 Shashi Vishwakarma <
>>>>>> shashi.vish123@gmail.com> wrote:
>>>>>>
>>>>>>> Thanks Andrew.
>>>>>>>
>>>>>>> I agree that decoupling component is good solution from long term
>>>>>>> perspective. My current data pipeline in Nifi is designed for batch
>>>>>>> processing which I am trying to convert into streaming model.
>>>>>>>
>>>>>>> One of the processor in data pipeline invokes Spark job , once job
>>>>>>> finished control  is returned to Nifi processor in turn which generates
>>>>>>> provenance event for job. This provenance event is important for us.
>>>>>>>
>>>>>>> Keeping batch model architecture in mind, I want to designed spark
>>>>>>> streaming based model in which Nifi Spark streaming processor will process
>>>>>>> micro batch and job status will be returned to Nifi with provenance event.
>>>>>>> Then I can capture that provenance data for my reports.
>>>>>>>
>>>>>>> Essentially I will be using Nifi for capturing provenance event
>>>>>>> where actual processing will be done by Spark streaming job.
>>>>>>>
>>>>>>> Do you see this approach logical ?
>>>>>>>
>>>>>>> Thanks
>>>>>>> Shashi
>>>>>>>
>>>>>>>
>>>>>>> On Sun, Jun 4, 2017 at 3:10 PM, Andrew Psaltis <
>>>>>>> psaltis.andrew@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi Shashi,
>>>>>>>> I'm sure there is a way to make this work. However, my first
>>>>>>>> question is why you would want to? By design a Spark Streaming application
>>>>>>>> should always be running and consuming data from some source, hence the
>>>>>>>> notion of streaming. Tying Spark Streaming to NiFi would ultimately result
>>>>>>>> in a more coupled and fragile architecture. Perhaps a different way to
>>>>>>>> think about it would be to set things up like this:
>>>>>>>>
>>>>>>>> NiFi --> Kafka <-- Spark Streaming
>>>>>>>>
>>>>>>>> With this you can do what you are doing today -- using NiFi to
>>>>>>>> ingest, transform, make routing decisions, and feed data into Kafka. In
>>>>>>>> essence you would be using NiFi to do all the preparation of the data for
>>>>>>>> Spark Streaming. Kafka would serve the purpose of a buffer between NiFi and
>>>>>>>> Spark Streaming. Finally, Spark Streaming would ingest data from Kafka and
>>>>>>>> do what it is designed for -- stream processing. Having a decoupled
>>>>>>>> architecture like this also allows you to manage each tier separately, thus
>>>>>>>> you can tune, scale, develop, and deploy all separately.
>>>>>>>>
>>>>>>>> I know I did not directly answer your question on how to make it
>>>>>>>> work. But, hopefully this helps provide an approach that will be a better
>>>>>>>> long term solution. There may be something I am missing in your initial
>>>>>>>> questions.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Andrew
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, Jun 3, 2017 at 10:43 PM, Shashi Vishwakarma <
>>>>>>>> shashi.vish123@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi
>>>>>>>>>
>>>>>>>>> I am looking for way where I can make use of spark streaming in
>>>>>>>>> Nifi. I see couple of post where SiteToSite tcp connection is used for
>>>>>>>>> spark streaming application but I thinking it will be good If I can launch
>>>>>>>>> Spark streaming from Nifi custom processor.
>>>>>>>>>
>>>>>>>>> PublishKafka will publish message into Kafka followed by Nifi
>>>>>>>>> Spark streaming processor will read from Kafka Topic.
>>>>>>>>>
>>>>>>>>> I can launch Spark streaming application from custom Nifi
>>>>>>>>> processor using Spark Streaming launcher API but biggest challenge is that
>>>>>>>>> it will create spark streaming context for each flow file which can be
>>>>>>>>> costly operation.
>>>>>>>>>
>>>>>>>>> Does any one suggest storing spark streaming context  in
>>>>>>>>> controller service ? or any better approach for running spark streaming
>>>>>>>>> application with Nifi ?
>>>>>>>>>
>>>>>>>>> Thanks and Regards,
>>>>>>>>> Shashi
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Thanks,
>>>>>>>> Andrew
>>>>>>>>
>>>>>>>> Subscribe to my book: Streaming Data <http://manning.com/psaltis>
>>>>>>>> <https://www.linkedin.com/pub/andrew-psaltis/1/17b/306>
>>>>>>>> twiiter: @itmdata
>>>>>>>> <http://twitter.com/intent/user?screen_name=itmdata>
>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>> Thanks,
>>>>>> Andrew
>>>>>>
>>>>>> Subscribe to my book: Streaming Data <http://manning.com/psaltis>
>>>>>> <https://www.linkedin.com/pub/andrew-psaltis/1/17b/306>
>>>>>> twiiter: @itmdata
>>>>>> <http://twitter.com/intent/user?screen_name=itmdata>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Thanks,
>>>> Andrew
>>>>
>>>> Subscribe to my book: Streaming Data <http://manning.com/psaltis>
>>>> <https://www.linkedin.com/pub/andrew-psaltis/1/17b/306>
>>>> twiiter: @itmdata <http://twitter.com/intent/user?screen_name=itmdata>
>>>>
>>>
>>>
>>
>>
>> --
>> Thanks,
>> Andrew
>>
>> Subscribe to my book: Streaming Data <http://manning.com/psaltis>
>> <https://www.linkedin.com/pub/andrew-psaltis/1/17b/306>
>> twiiter: @itmdata <http://twitter.com/intent/user?screen_name=itmdata>
>>
>
>


-- 
Thanks,
Andrew

Subscribe to my book: Streaming Data <http://manning.com/psaltis>
<https://www.linkedin.com/pub/andrew-psaltis/1/17b/306>
twiiter: @itmdata <http://twitter.com/intent/user?screen_name=itmdata>

Re: Spark Streaming with Nifi

Posted by Shashi Vishwakarma <sh...@gmail.com>.

Hi Andrew,

Regarding creating spark provenance event,

*"let's call it a Spark Provenance Event -- in this you can populate as
much data as you have and write that to a similar data store."*

Is there any way I can write my spark provenance event to Nifi provenance
store with some EventId ?

I have ReportingTask which sends event to another application but it relies
on Nifi provenance store. I am thinking  that spark job will emit
provenance event which will be written in Nifi provenance store and
reporting task will send that to another application.

Apologies if my use case is still unclear.

Thanks
Shashi



On Thu, Jun 8, 2017 at 3:23 AM, Andrew Psaltis <ps...@gmail.com>
wrote:

> Hi Shashi,
> Regarding your upgrade question, I may have confused things. When emitting
> a "provenance" event from your Spark Streaming job, this will not be the
> same exact event as that emitted from NiFi. I was referencing the code in
> the previous email to give insight into the details NiFi does provide. In
> your Spark application you will not have all of the information to populate
> a NiFi Provenance event. Therefore, for your Spark code you can come up
> with a new event, let's call it a Spark Provenance Event -- in this you can
> populate as much data as you have and write that to a similar data store.
> For example you would want a timestamp, the component can be Spark and any
> other data you need to emit. Basically, you will be combing the NiFi
> provenance data with your customer spark provenance data to create a
> complete picture.
>
> As far as the lineage goes, again your Spark streaming code will be
> executing outside of NiFi and you will have to write this into some other
> store, perhaps to Atlas and then you can have the lineage for both NiFi and
> Spark. This [1] is an example NiFi reporting tasks that sends lineage data
> to Atlas, you could extend this concept to work with Spark as well.
>
> Hopefully this helps clarify some things, sorry if my previous email was
> not completely clear.
>
> Thanks
> Andrew
>
> On Wed, Jun 7, 2017 at 4:31 PM, Shashi Vishwakarma <
> shashi.vish123@gmail.com> wrote:
>
>> Thanks a lot Andrew. This is something I was looking for.
>>
>> I have two question at point keeping in mind I have generate provenance
>> event.
>>
>> 1. How will I manage upgrade ? If I generate custom provenance and Nifi
>> community made significant changes in Nifi provenance structure ?
>>
>> 2. How do I get lineage information ?
>>
>> Thanks
>> Shashi
>>
>>
>> On Tue, Jun 6, 2017 at 7:38 PM, Andrew Psaltis <ps...@gmail.com>
>> wrote:
>>
>>> Hi Shashi,
>>> Your assumption is correct -- you would want to send a "provenance"
>>> event from your Spark job, you can see the structure of the provenance
>>> events in NiFi here [1]
>>>
>>> Regarding the flow, if you are waiting on the Spark Streaming code to
>>> compute some value before you continue you can construct it perhaps this
>>> way:
>>>
>>>
>>> [image: Inline image 1]
>>>
>>> Hopefully that helps to clarify it a little. In essence if you are
>>> waiting on results form the Spark Streaming computation before continuing
>>> you would use Kafka for the output results from Spark Streaming and then
>>> consume that in NiFi and carry on with your processing.
>>>
>>> [1] https://github.com/apache/nifi/blob/master/nifi-nar-bund
>>> les/nifi-framework-bundle/nifi-framework/nifi-client-dto/src
>>> /main/java/org/apache/nifi/web/api/dto/provenance/Provenance
>>> EventDTO.java
>>>
>>> Thanks,
>>> Andrew
>>>
>>> On Mon, Jun 5, 2017 at 9:46 AM, Shashi Vishwakarma <
>>> shashi.vish123@gmail.com> wrote:
>>>
>>>> Hi Andrew,
>>>>
>>>> I am trying to understand here bit more in detail. Essentially I will
>>>> have to write some custom code in my spark streaming job and construct
>>>> provenance event and send it to some store like Hbase,PubSub system to be
>>>> consumed by others.
>>>>
>>>> Is that correct ?
>>>>
>>>> If yes how do I execute other processor which are present in pipeline ?
>>>>
>>>> Ex
>>>>
>>>> Nifi --> Kakfa -- > Spark Streaming --> Processor 1 --> Processor 2
>>>>
>>>> Thanks
>>>> Shashi
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Mon, Jun 5, 2017 at 12:36 AM, Andrew Psaltis <
>>>> psaltis.andrew@gmail.com> wrote:
>>>>
>>>>> Hi Shashi,
>>>>> Thanks for the explanation.  I have a better understanding of what you
>>>>> are trying to accomplish. Although Spark streaming is micro-batch, you
>>>>> would not want to keep launching jobs for each batch.   Think of it as the
>>>>> Spark scheduler having a while loop in which it executes your job then
>>>>> sleeps for X amount of time based on the interval you configure.
>>>>>
>>>>> Perhaps a better way would be to do the following:
>>>>> 1. Use the S2S ProvenanceReportingTask to send provenance information
>>>>> from your NiFi instance to a second instance or cluster.
>>>>> 2. In the second NiFi instance/cluster ( the one receiving the
>>>>> provenance data) you write the data into say HBase or Solr or system X.
>>>>> 3. In your Spark streaming job you right into the same data store a
>>>>> "provenance" event -- obviously this will not have all the fields that a
>>>>> true NiFi provenance record does, but you can come close.
>>>>>
>>>>> With this then once you would then have all provenance data in an
>>>>> external system that you can query to understand the whole system.
>>>>>
>>>>> Thanks,
>>>>> Andrew
>>>>>
>>>>> P.S. sorry if this is choppy or not well formed, on mobile.
>>>>>
>>>>> On Sun, Jun 4, 2017 at 17:46 Shashi Vishwakarma <
>>>>> shashi.vish123@gmail.com> wrote:
>>>>>
>>>>>> Thanks Andrew.
>>>>>>
>>>>>> I agree that decoupling component is good solution from long term
>>>>>> perspective. My current data pipeline in Nifi is designed for batch
>>>>>> processing which I am trying to convert into streaming model.
>>>>>>
>>>>>> One of the processor in data pipeline invokes Spark job , once job
>>>>>> finished control  is returned to Nifi processor in turn which generates
>>>>>> provenance event for job. This provenance event is important for us.
>>>>>>
>>>>>> Keeping batch model architecture in mind, I want to designed spark
>>>>>> streaming based model in which Nifi Spark streaming processor will process
>>>>>> micro batch and job status will be returned to Nifi with provenance event.
>>>>>> Then I can capture that provenance data for my reports.
>>>>>>
>>>>>> Essentially I will be using Nifi for capturing provenance event where
>>>>>> actual processing will be done by Spark streaming job.
>>>>>>
>>>>>> Do you see this approach logical ?
>>>>>>
>>>>>> Thanks
>>>>>> Shashi
>>>>>>
>>>>>>
>>>>>> On Sun, Jun 4, 2017 at 3:10 PM, Andrew Psaltis <
>>>>>> psaltis.andrew@gmail.com> wrote:
>>>>>>
>>>>>>> Hi Shashi,
>>>>>>> I'm sure there is a way to make this work. However, my first
>>>>>>> question is why you would want to? By design a Spark Streaming application
>>>>>>> should always be running and consuming data from some source, hence the
>>>>>>> notion of streaming. Tying Spark Streaming to NiFi would ultimately result
>>>>>>> in a more coupled and fragile architecture. Perhaps a different way to
>>>>>>> think about it would be to set things up like this:
>>>>>>>
>>>>>>> NiFi --> Kafka <-- Spark Streaming
>>>>>>>
>>>>>>> With this you can do what you are doing today -- using NiFi to
>>>>>>> ingest, transform, make routing decisions, and feed data into Kafka. In
>>>>>>> essence you would be using NiFi to do all the preparation of the data for
>>>>>>> Spark Streaming. Kafka would serve the purpose of a buffer between NiFi and
>>>>>>> Spark Streaming. Finally, Spark Streaming would ingest data from Kafka and
>>>>>>> do what it is designed for -- stream processing. Having a decoupled
>>>>>>> architecture like this also allows you to manage each tier separately, thus
>>>>>>> you can tune, scale, develop, and deploy all separately.
>>>>>>>
>>>>>>> I know I did not directly answer your question on how to make it
>>>>>>> work. But, hopefully this helps provide an approach that will be a better
>>>>>>> long term solution. There may be something I am missing in your initial
>>>>>>> questions.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Andrew
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Sat, Jun 3, 2017 at 10:43 PM, Shashi Vishwakarma <
>>>>>>> shashi.vish123@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi
>>>>>>>>
>>>>>>>> I am looking for way where I can make use of spark streaming in
>>>>>>>> Nifi. I see couple of post where SiteToSite tcp connection is used for
>>>>>>>> spark streaming application but I thinking it will be good If I can launch
>>>>>>>> Spark streaming from Nifi custom processor.
>>>>>>>>
>>>>>>>> PublishKafka will publish message into Kafka followed by Nifi Spark
>>>>>>>> streaming processor will read from Kafka Topic.
>>>>>>>>
>>>>>>>> I can launch Spark streaming application from custom Nifi processor
>>>>>>>> using Spark Streaming launcher API but biggest challenge is that it will
>>>>>>>> create spark streaming context for each flow file which can be costly
>>>>>>>> operation.
>>>>>>>>
>>>>>>>> Does any one suggest storing spark streaming context  in controller
>>>>>>>> service ? or any better approach for running spark streaming application
>>>>>>>> with Nifi ?
>>>>>>>>
>>>>>>>> Thanks and Regards,
>>>>>>>> Shashi
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Thanks,
>>>>>>> Andrew
>>>>>>>
>>>>>>> Subscribe to my book: Streaming Data <http://manning.com/psaltis>
>>>>>>> <https://www.linkedin.com/pub/andrew-psaltis/1/17b/306>
>>>>>>> twiiter: @itmdata
>>>>>>> <http://twitter.com/intent/user?screen_name=itmdata>
>>>>>>>
>>>>>>
>>>>>> --
>>>>> Thanks,
>>>>> Andrew
>>>>>
>>>>> Subscribe to my book: Streaming Data <http://manning.com/psaltis>
>>>>> <https://www.linkedin.com/pub/andrew-psaltis/1/17b/306>
>>>>> twiiter: @itmdata <http://twitter.com/intent/user?screen_name=itmdata>
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Thanks,
>>> Andrew
>>>
>>> Subscribe to my book: Streaming Data <http://manning.com/psaltis>
>>> <https://www.linkedin.com/pub/andrew-psaltis/1/17b/306>
>>> twiiter: @itmdata <http://twitter.com/intent/user?screen_name=itmdata>
>>>
>>
>>
>
>
> --
> Thanks,
> Andrew
>
> Subscribe to my book: Streaming Data <http://manning.com/psaltis>
> <https://www.linkedin.com/pub/andrew-psaltis/1/17b/306>
> twiiter: @itmdata <http://twitter.com/intent/user?screen_name=itmdata>
>

Re: Spark Streaming with Nifi

Posted by Andrew Psaltis <ps...@gmail.com>.

Hi Shashi,
Regarding your upgrade question, I may have confused things. When emitting
a "provenance" event from your Spark Streaming job, this will not be the
same exact event as that emitted from NiFi. I was referencing the code in
the previous email to give insight into the details NiFi does provide. In
your Spark application you will not have all of the information to populate
a NiFi Provenance event. Therefore, for your Spark code you can come up
with a new event, let's call it a Spark Provenance Event -- in this you can
populate as much data as you have and write that to a similar data store.
For example you would want a timestamp, the component can be Spark and any
other data you need to emit. Basically, you will be combing the NiFi
provenance data with your customer spark provenance data to create a
complete picture.

As far as the lineage goes, again your Spark streaming code will be
executing outside of NiFi and you will have to write this into some other
store, perhaps to Atlas and then you can have the lineage for both NiFi and
Spark. This [1] is an example NiFi reporting tasks that sends lineage data
to Atlas, you could extend this concept to work with Spark as well.

Hopefully this helps clarify some things, sorry if my previous email was
not completely clear.

Thanks
Andrew

On Wed, Jun 7, 2017 at 4:31 PM, Shashi Vishwakarma <shashi.vish123@gmail.com
> wrote:

> Thanks a lot Andrew. This is something I was looking for.
>
> I have two question at point keeping in mind I have generate provenance
> event.
>
> 1. How will I manage upgrade ? If I generate custom provenance and Nifi
> community made significant changes in Nifi provenance structure ?
>
> 2. How do I get lineage information ?
>
> Thanks
> Shashi
>
>
> On Tue, Jun 6, 2017 at 7:38 PM, Andrew Psaltis <ps...@gmail.com>
> wrote:
>
>> Hi Shashi,
>> Your assumption is correct -- you would want to send a "provenance" event
>> from your Spark job, you can see the structure of the provenance events in
>> NiFi here [1]
>>
>> Regarding the flow, if you are waiting on the Spark Streaming code to
>> compute some value before you continue you can construct it perhaps this
>> way:
>>
>>
>> [image: Inline image 1]
>>
>> Hopefully that helps to clarify it a little. In essence if you are
>> waiting on results form the Spark Streaming computation before continuing
>> you would use Kafka for the output results from Spark Streaming and then
>> consume that in NiFi and carry on with your processing.
>>
>> [1] https://github.com/apache/nifi/blob/master/nifi-nar-bund
>> les/nifi-framework-bundle/nifi-framework/nifi-client-dto/
>> src/main/java/org/apache/nifi/web/api/dto/provenance/Provena
>> nceEventDTO.java
>>
>> Thanks,
>> Andrew
>>
>> On Mon, Jun 5, 2017 at 9:46 AM, Shashi Vishwakarma <
>> shashi.vish123@gmail.com> wrote:
>>
>>> Hi Andrew,
>>>
>>> I am trying to understand here bit more in detail. Essentially I will
>>> have to write some custom code in my spark streaming job and construct
>>> provenance event and send it to some store like Hbase,PubSub system to be
>>> consumed by others.
>>>
>>> Is that correct ?
>>>
>>> If yes how do I execute other processor which are present in pipeline ?
>>>
>>> Ex
>>>
>>> Nifi --> Kakfa -- > Spark Streaming --> Processor 1 --> Processor 2
>>>
>>> Thanks
>>> Shashi
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Mon, Jun 5, 2017 at 12:36 AM, Andrew Psaltis <
>>> psaltis.andrew@gmail.com> wrote:
>>>
>>>> Hi Shashi,
>>>> Thanks for the explanation.  I have a better understanding of what you
>>>> are trying to accomplish. Although Spark streaming is micro-batch, you
>>>> would not want to keep launching jobs for each batch.   Think of it as the
>>>> Spark scheduler having a while loop in which it executes your job then
>>>> sleeps for X amount of time based on the interval you configure.
>>>>
>>>> Perhaps a better way would be to do the following:
>>>> 1. Use the S2S ProvenanceReportingTask to send provenance information
>>>> from your NiFi instance to a second instance or cluster.
>>>> 2. In the second NiFi instance/cluster ( the one receiving the
>>>> provenance data) you write the data into say HBase or Solr or system X.
>>>> 3. In your Spark streaming job you right into the same data store a
>>>> "provenance" event -- obviously this will not have all the fields that a
>>>> true NiFi provenance record does, but you can come close.
>>>>
>>>> With this then once you would then have all provenance data in an
>>>> external system that you can query to understand the whole system.
>>>>
>>>> Thanks,
>>>> Andrew
>>>>
>>>> P.S. sorry if this is choppy or not well formed, on mobile.
>>>>
>>>> On Sun, Jun 4, 2017 at 17:46 Shashi Vishwakarma <
>>>> shashi.vish123@gmail.com> wrote:
>>>>
>>>>> Thanks Andrew.
>>>>>
>>>>> I agree that decoupling component is good solution from long term
>>>>> perspective. My current data pipeline in Nifi is designed for batch
>>>>> processing which I am trying to convert into streaming model.
>>>>>
>>>>> One of the processor in data pipeline invokes Spark job , once job
>>>>> finished control  is returned to Nifi processor in turn which generates
>>>>> provenance event for job. This provenance event is important for us.
>>>>>
>>>>> Keeping batch model architecture in mind, I want to designed spark
>>>>> streaming based model in which Nifi Spark streaming processor will process
>>>>> micro batch and job status will be returned to Nifi with provenance event.
>>>>> Then I can capture that provenance data for my reports.
>>>>>
>>>>> Essentially I will be using Nifi for capturing provenance event where
>>>>> actual processing will be done by Spark streaming job.
>>>>>
>>>>> Do you see this approach logical ?
>>>>>
>>>>> Thanks
>>>>> Shashi
>>>>>
>>>>>
>>>>> On Sun, Jun 4, 2017 at 3:10 PM, Andrew Psaltis <
>>>>> psaltis.andrew@gmail.com> wrote:
>>>>>
>>>>>> Hi Shashi,
>>>>>> I'm sure there is a way to make this work. However, my first question
>>>>>> is why you would want to? By design a Spark Streaming application should
>>>>>> always be running and consuming data from some source, hence the notion of
>>>>>> streaming. Tying Spark Streaming to NiFi would ultimately result in a more
>>>>>> coupled and fragile architecture. Perhaps a different way to think about it
>>>>>> would be to set things up like this:
>>>>>>
>>>>>> NiFi --> Kafka <-- Spark Streaming
>>>>>>
>>>>>> With this you can do what you are doing today -- using NiFi to
>>>>>> ingest, transform, make routing decisions, and feed data into Kafka. In
>>>>>> essence you would be using NiFi to do all the preparation of the data for
>>>>>> Spark Streaming. Kafka would serve the purpose of a buffer between NiFi and
>>>>>> Spark Streaming. Finally, Spark Streaming would ingest data from Kafka and
>>>>>> do what it is designed for -- stream processing. Having a decoupled
>>>>>> architecture like this also allows you to manage each tier separately, thus
>>>>>> you can tune, scale, develop, and deploy all separately.
>>>>>>
>>>>>> I know I did not directly answer your question on how to make it
>>>>>> work. But, hopefully this helps provide an approach that will be a better
>>>>>> long term solution. There may be something I am missing in your initial
>>>>>> questions.
>>>>>>
>>>>>> Thanks,
>>>>>> Andrew
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Sat, Jun 3, 2017 at 10:43 PM, Shashi Vishwakarma <
>>>>>> shashi.vish123@gmail.com> wrote:
>>>>>>
>>>>>>> Hi
>>>>>>>
>>>>>>> I am looking for way where I can make use of spark streaming in
>>>>>>> Nifi. I see couple of post where SiteToSite tcp connection is used for
>>>>>>> spark streaming application but I thinking it will be good If I can launch
>>>>>>> Spark streaming from Nifi custom processor.
>>>>>>>
>>>>>>> PublishKafka will publish message into Kafka followed by Nifi Spark
>>>>>>> streaming processor will read from Kafka Topic.
>>>>>>>
>>>>>>> I can launch Spark streaming application from custom Nifi processor
>>>>>>> using Spark Streaming launcher API but biggest challenge is that it will
>>>>>>> create spark streaming context for each flow file which can be costly
>>>>>>> operation.
>>>>>>>
>>>>>>> Does any one suggest storing spark streaming context  in controller
>>>>>>> service ? or any better approach for running spark streaming application
>>>>>>> with Nifi ?
>>>>>>>
>>>>>>> Thanks and Regards,
>>>>>>> Shashi
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Thanks,
>>>>>> Andrew
>>>>>>
>>>>>> Subscribe to my book: Streaming Data <http://manning.com/psaltis>
>>>>>> <https://www.linkedin.com/pub/andrew-psaltis/1/17b/306>
>>>>>> twiiter: @itmdata
>>>>>> <http://twitter.com/intent/user?screen_name=itmdata>
>>>>>>
>>>>>
>>>>> --
>>>> Thanks,
>>>> Andrew
>>>>
>>>> Subscribe to my book: Streaming Data <http://manning.com/psaltis>
>>>> <https://www.linkedin.com/pub/andrew-psaltis/1/17b/306>
>>>> twiiter: @itmdata <http://twitter.com/intent/user?screen_name=itmdata>
>>>>
>>>
>>>
>>
>>
>> --
>> Thanks,
>> Andrew
>>
>> Subscribe to my book: Streaming Data <http://manning.com/psaltis>
>> <https://www.linkedin.com/pub/andrew-psaltis/1/17b/306>
>> twiiter: @itmdata <http://twitter.com/intent/user?screen_name=itmdata>
>>
>
>


-- 
Thanks,
Andrew

Subscribe to my book: Streaming Data <http://manning.com/psaltis>
<https://www.linkedin.com/pub/andrew-psaltis/1/17b/306>
twiiter: @itmdata <http://twitter.com/intent/user?screen_name=itmdata>

Re: Spark Streaming with Nifi

Posted by Shashi Vishwakarma <sh...@gmail.com>.

Thanks a lot Andrew. This is something I was looking for.

I have two question at point keeping in mind I have generate provenance
event.

1. How will I manage upgrade ? If I generate custom provenance and Nifi
community made significant changes in Nifi provenance structure ?

2. How do I get lineage information ?

Thanks
Shashi


On Tue, Jun 6, 2017 at 7:38 PM, Andrew Psaltis <ps...@gmail.com>
wrote:

> Hi Shashi,
> Your assumption is correct -- you would want to send a "provenance" event
> from your Spark job, you can see the structure of the provenance events in
> NiFi here [1]
>
> Regarding the flow, if you are waiting on the Spark Streaming code to
> compute some value before you continue you can construct it perhaps this
> way:
>
>
> [image: Inline image 1]
>
> Hopefully that helps to clarify it a little. In essence if you are waiting
> on results form the Spark Streaming computation before continuing you would
> use Kafka for the output results from Spark Streaming and then consume that
> in NiFi and carry on with your processing.
>
> [1] https://github.com/apache/nifi/blob/master/nifi-nar-
> bundles/nifi-framework-bundle/nifi-framework/nifi-client-
> dto/src/main/java/org/apache/nifi/web/api/dto/provenance/
> ProvenanceEventDTO.java
>
> Thanks,
> Andrew
>
> On Mon, Jun 5, 2017 at 9:46 AM, Shashi Vishwakarma <
> shashi.vish123@gmail.com> wrote:
>
>> Hi Andrew,
>>
>> I am trying to understand here bit more in detail. Essentially I will
>> have to write some custom code in my spark streaming job and construct
>> provenance event and send it to some store like Hbase,PubSub system to be
>> consumed by others.
>>
>> Is that correct ?
>>
>> If yes how do I execute other processor which are present in pipeline ?
>>
>> Ex
>>
>> Nifi --> Kakfa -- > Spark Streaming --> Processor 1 --> Processor 2
>>
>> Thanks
>> Shashi
>>
>>
>>
>>
>>
>>
>>
>>
>> On Mon, Jun 5, 2017 at 12:36 AM, Andrew Psaltis <psaltis.andrew@gmail.com
>> > wrote:
>>
>>> Hi Shashi,
>>> Thanks for the explanation.  I have a better understanding of what you
>>> are trying to accomplish. Although Spark streaming is micro-batch, you
>>> would not want to keep launching jobs for each batch.   Think of it as the
>>> Spark scheduler having a while loop in which it executes your job then
>>> sleeps for X amount of time based on the interval you configure.
>>>
>>> Perhaps a better way would be to do the following:
>>> 1. Use the S2S ProvenanceReportingTask to send provenance information
>>> from your NiFi instance to a second instance or cluster.
>>> 2. In the second NiFi instance/cluster ( the one receiving the
>>> provenance data) you write the data into say HBase or Solr or system X.
>>> 3. In your Spark streaming job you right into the same data store a
>>> "provenance" event -- obviously this will not have all the fields that a
>>> true NiFi provenance record does, but you can come close.
>>>
>>> With this then once you would then have all provenance data in an
>>> external system that you can query to understand the whole system.
>>>
>>> Thanks,
>>> Andrew
>>>
>>> P.S. sorry if this is choppy or not well formed, on mobile.
>>>
>>> On Sun, Jun 4, 2017 at 17:46 Shashi Vishwakarma <
>>> shashi.vish123@gmail.com> wrote:
>>>
>>>> Thanks Andrew.
>>>>
>>>> I agree that decoupling component is good solution from long term
>>>> perspective. My current data pipeline in Nifi is designed for batch
>>>> processing which I am trying to convert into streaming model.
>>>>
>>>> One of the processor in data pipeline invokes Spark job , once job
>>>> finished control  is returned to Nifi processor in turn which generates
>>>> provenance event for job. This provenance event is important for us.
>>>>
>>>> Keeping batch model architecture in mind, I want to designed spark
>>>> streaming based model in which Nifi Spark streaming processor will process
>>>> micro batch and job status will be returned to Nifi with provenance event.
>>>> Then I can capture that provenance data for my reports.
>>>>
>>>> Essentially I will be using Nifi for capturing provenance event where
>>>> actual processing will be done by Spark streaming job.
>>>>
>>>> Do you see this approach logical ?
>>>>
>>>> Thanks
>>>> Shashi
>>>>
>>>>
>>>> On Sun, Jun 4, 2017 at 3:10 PM, Andrew Psaltis <
>>>> psaltis.andrew@gmail.com> wrote:
>>>>
>>>>> Hi Shashi,
>>>>> I'm sure there is a way to make this work. However, my first question
>>>>> is why you would want to? By design a Spark Streaming application should
>>>>> always be running and consuming data from some source, hence the notion of
>>>>> streaming. Tying Spark Streaming to NiFi would ultimately result in a more
>>>>> coupled and fragile architecture. Perhaps a different way to think about it
>>>>> would be to set things up like this:
>>>>>
>>>>> NiFi --> Kafka <-- Spark Streaming
>>>>>
>>>>> With this you can do what you are doing today -- using NiFi to ingest,
>>>>> transform, make routing decisions, and feed data into Kafka. In essence you
>>>>> would be using NiFi to do all the preparation of the data for Spark
>>>>> Streaming. Kafka would serve the purpose of a buffer between NiFi and Spark
>>>>> Streaming. Finally, Spark Streaming would ingest data from Kafka and do
>>>>> what it is designed for -- stream processing. Having a decoupled
>>>>> architecture like this also allows you to manage each tier separately, thus
>>>>> you can tune, scale, develop, and deploy all separately.
>>>>>
>>>>> I know I did not directly answer your question on how to make it work.
>>>>> But, hopefully this helps provide an approach that will be a better long
>>>>> term solution. There may be something I am missing in your initial
>>>>> questions.
>>>>>
>>>>> Thanks,
>>>>> Andrew
>>>>>
>>>>>
>>>>>
>>>>> On Sat, Jun 3, 2017 at 10:43 PM, Shashi Vishwakarma <
>>>>> shashi.vish123@gmail.com> wrote:
>>>>>
>>>>>> Hi
>>>>>>
>>>>>> I am looking for way where I can make use of spark streaming in Nifi.
>>>>>> I see couple of post where SiteToSite tcp connection is used for spark
>>>>>> streaming application but I thinking it will be good If I can launch Spark
>>>>>> streaming from Nifi custom processor.
>>>>>>
>>>>>> PublishKafka will publish message into Kafka followed by Nifi Spark
>>>>>> streaming processor will read from Kafka Topic.
>>>>>>
>>>>>> I can launch Spark streaming application from custom Nifi processor
>>>>>> using Spark Streaming launcher API but biggest challenge is that it will
>>>>>> create spark streaming context for each flow file which can be costly
>>>>>> operation.
>>>>>>
>>>>>> Does any one suggest storing spark streaming context  in controller
>>>>>> service ? or any better approach for running spark streaming application
>>>>>> with Nifi ?
>>>>>>
>>>>>> Thanks and Regards,
>>>>>> Shashi
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Thanks,
>>>>> Andrew
>>>>>
>>>>> Subscribe to my book: Streaming Data <http://manning.com/psaltis>
>>>>> <https://www.linkedin.com/pub/andrew-psaltis/1/17b/306>
>>>>> twiiter: @itmdata <http://twitter.com/intent/user?screen_name=itmdata>
>>>>>
>>>>
>>>> --
>>> Thanks,
>>> Andrew
>>>
>>> Subscribe to my book: Streaming Data <http://manning.com/psaltis>
>>> <https://www.linkedin.com/pub/andrew-psaltis/1/17b/306>
>>> twiiter: @itmdata <http://twitter.com/intent/user?screen_name=itmdata>
>>>
>>
>>
>
>
> --
> Thanks,
> Andrew
>
> Subscribe to my book: Streaming Data <http://manning.com/psaltis>
> <https://www.linkedin.com/pub/andrew-psaltis/1/17b/306>
> twiiter: @itmdata <http://twitter.com/intent/user?screen_name=itmdata>
>

Re: Spark Streaming with Nifi

Posted by Andrew Psaltis <ps...@gmail.com>.

Hi Shashi,
Your assumption is correct -- you would want to send a "provenance" event
from your Spark job, you can see the structure of the provenance events in
NiFi here [1]

Regarding the flow, if you are waiting on the Spark Streaming code to
compute some value before you continue you can construct it perhaps this
way:


[image: Inline image 1]

Hopefully that helps to clarify it a little. In essence if you are waiting
on results form the Spark Streaming computation before continuing you would
use Kafka for the output results from Spark Streaming and then consume that
in NiFi and carry on with your processing.

[1]
https://github.com/apache/nifi/blob/master/nifi-nar-bundles/nifi-framework-bundle/nifi-framework/nifi-client-dto/src/main/java/org/apache/nifi/web/api/dto/provenance/ProvenanceEventDTO.java

Thanks,
Andrew

On Mon, Jun 5, 2017 at 9:46 AM, Shashi Vishwakarma <shashi.vish123@gmail.com
> wrote:

> Hi Andrew,
>
> I am trying to understand here bit more in detail. Essentially I will have
> to write some custom code in my spark streaming job and construct
> provenance event and send it to some store like Hbase,PubSub system to be
> consumed by others.
>
> Is that correct ?
>
> If yes how do I execute other processor which are present in pipeline ?
>
> Ex
>
> Nifi --> Kakfa -- > Spark Streaming --> Processor 1 --> Processor 2
>
> Thanks
> Shashi
>
>
>
>
>
>
>
>
> On Mon, Jun 5, 2017 at 12:36 AM, Andrew Psaltis <ps...@gmail.com>
> wrote:
>
>> Hi Shashi,
>> Thanks for the explanation.  I have a better understanding of what you
>> are trying to accomplish. Although Spark streaming is micro-batch, you
>> would not want to keep launching jobs for each batch.   Think of it as the
>> Spark scheduler having a while loop in which it executes your job then
>> sleeps for X amount of time based on the interval you configure.
>>
>> Perhaps a better way would be to do the following:
>> 1. Use the S2S ProvenanceReportingTask to send provenance information
>> from your NiFi instance to a second instance or cluster.
>> 2. In the second NiFi instance/cluster ( the one receiving the provenance
>> data) you write the data into say HBase or Solr or system X.
>> 3. In your Spark streaming job you right into the same data store a
>> "provenance" event -- obviously this will not have all the fields that a
>> true NiFi provenance record does, but you can come close.
>>
>> With this then once you would then have all provenance data in an
>> external system that you can query to understand the whole system.
>>
>> Thanks,
>> Andrew
>>
>> P.S. sorry if this is choppy or not well formed, on mobile.
>>
>> On Sun, Jun 4, 2017 at 17:46 Shashi Vishwakarma <sh...@gmail.com>
>> wrote:
>>
>>> Thanks Andrew.
>>>
>>> I agree that decoupling component is good solution from long term
>>> perspective. My current data pipeline in Nifi is designed for batch
>>> processing which I am trying to convert into streaming model.
>>>
>>> One of the processor in data pipeline invokes Spark job , once job
>>> finished control  is returned to Nifi processor in turn which generates
>>> provenance event for job. This provenance event is important for us.
>>>
>>> Keeping batch model architecture in mind, I want to designed spark
>>> streaming based model in which Nifi Spark streaming processor will process
>>> micro batch and job status will be returned to Nifi with provenance event.
>>> Then I can capture that provenance data for my reports.
>>>
>>> Essentially I will be using Nifi for capturing provenance event where
>>> actual processing will be done by Spark streaming job.
>>>
>>> Do you see this approach logical ?
>>>
>>> Thanks
>>> Shashi
>>>
>>>
>>> On Sun, Jun 4, 2017 at 3:10 PM, Andrew Psaltis <psaltis.andrew@gmail.com
>>> > wrote:
>>>
>>>> Hi Shashi,
>>>> I'm sure there is a way to make this work. However, my first question
>>>> is why you would want to? By design a Spark Streaming application should
>>>> always be running and consuming data from some source, hence the notion of
>>>> streaming. Tying Spark Streaming to NiFi would ultimately result in a more
>>>> coupled and fragile architecture. Perhaps a different way to think about it
>>>> would be to set things up like this:
>>>>
>>>> NiFi --> Kafka <-- Spark Streaming
>>>>
>>>> With this you can do what you are doing today -- using NiFi to ingest,
>>>> transform, make routing decisions, and feed data into Kafka. In essence you
>>>> would be using NiFi to do all the preparation of the data for Spark
>>>> Streaming. Kafka would serve the purpose of a buffer between NiFi and Spark
>>>> Streaming. Finally, Spark Streaming would ingest data from Kafka and do
>>>> what it is designed for -- stream processing. Having a decoupled
>>>> architecture like this also allows you to manage each tier separately, thus
>>>> you can tune, scale, develop, and deploy all separately.
>>>>
>>>> I know I did not directly answer your question on how to make it work.
>>>> But, hopefully this helps provide an approach that will be a better long
>>>> term solution. There may be something I am missing in your initial
>>>> questions.
>>>>
>>>> Thanks,
>>>> Andrew
>>>>
>>>>
>>>>
>>>> On Sat, Jun 3, 2017 at 10:43 PM, Shashi Vishwakarma <
>>>> shashi.vish123@gmail.com> wrote:
>>>>
>>>>> Hi
>>>>>
>>>>> I am looking for way where I can make use of spark streaming in Nifi.
>>>>> I see couple of post where SiteToSite tcp connection is used for spark
>>>>> streaming application but I thinking it will be good If I can launch Spark
>>>>> streaming from Nifi custom processor.
>>>>>
>>>>> PublishKafka will publish message into Kafka followed by Nifi Spark
>>>>> streaming processor will read from Kafka Topic.
>>>>>
>>>>> I can launch Spark streaming application from custom Nifi processor
>>>>> using Spark Streaming launcher API but biggest challenge is that it will
>>>>> create spark streaming context for each flow file which can be costly
>>>>> operation.
>>>>>
>>>>> Does any one suggest storing spark streaming context  in controller
>>>>> service ? or any better approach for running spark streaming application
>>>>> with Nifi ?
>>>>>
>>>>> Thanks and Regards,
>>>>> Shashi
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Thanks,
>>>> Andrew
>>>>
>>>> Subscribe to my book: Streaming Data <http://manning.com/psaltis>
>>>> <https://www.linkedin.com/pub/andrew-psaltis/1/17b/306>
>>>> twiiter: @itmdata <http://twitter.com/intent/user?screen_name=itmdata>
>>>>
>>>
>>> --
>> Thanks,
>> Andrew
>>
>> Subscribe to my book: Streaming Data <http://manning.com/psaltis>
>> <https://www.linkedin.com/pub/andrew-psaltis/1/17b/306>
>> twiiter: @itmdata <http://twitter.com/intent/user?screen_name=itmdata>
>>
>
>


-- 
Thanks,
Andrew

Subscribe to my book: Streaming Data <http://manning.com/psaltis>
<https://www.linkedin.com/pub/andrew-psaltis/1/17b/306>
twiiter: @itmdata <http://twitter.com/intent/user?screen_name=itmdata>

Re: Spark Streaming with Nifi

Posted by Shashi Vishwakarma <sh...@gmail.com>.

Hi Andrew,

I am trying to understand here bit more in detail. Essentially I will have
to write some custom code in my spark streaming job and construct
provenance event and send it to some store like Hbase,PubSub system to be
consumed by others.

Is that correct ?

If yes how do I execute other processor which are present in pipeline ?

Ex

Nifi --> Kakfa -- > Spark Streaming --> Processor 1 --> Processor 2

Thanks
Shashi








On Mon, Jun 5, 2017 at 12:36 AM, Andrew Psaltis <ps...@gmail.com>
wrote:

> Hi Shashi,
> Thanks for the explanation.  I have a better understanding of what you are
> trying to accomplish. Although Spark streaming is micro-batch, you would
> not want to keep launching jobs for each batch.   Think of it as the Spark
> scheduler having a while loop in which it executes your job then sleeps for
> X amount of time based on the interval you configure.
>
> Perhaps a better way would be to do the following:
> 1. Use the S2S ProvenanceReportingTask to send provenance information from
> your NiFi instance to a second instance or cluster.
> 2. In the second NiFi instance/cluster ( the one receiving the provenance
> data) you write the data into say HBase or Solr or system X.
> 3. In your Spark streaming job you right into the same data store a
> "provenance" event -- obviously this will not have all the fields that a
> true NiFi provenance record does, but you can come close.
>
> With this then once you would then have all provenance data in an external
> system that you can query to understand the whole system.
>
> Thanks,
> Andrew
>
> P.S. sorry if this is choppy or not well formed, on mobile.
>
> On Sun, Jun 4, 2017 at 17:46 Shashi Vishwakarma <sh...@gmail.com>
> wrote:
>
>> Thanks Andrew.
>>
>> I agree that decoupling component is good solution from long term
>> perspective. My current data pipeline in Nifi is designed for batch
>> processing which I am trying to convert into streaming model.
>>
>> One of the processor in data pipeline invokes Spark job , once job
>> finished control  is returned to Nifi processor in turn which generates
>> provenance event for job. This provenance event is important for us.
>>
>> Keeping batch model architecture in mind, I want to designed spark
>> streaming based model in which Nifi Spark streaming processor will process
>> micro batch and job status will be returned to Nifi with provenance event.
>> Then I can capture that provenance data for my reports.
>>
>> Essentially I will be using Nifi for capturing provenance event where
>> actual processing will be done by Spark streaming job.
>>
>> Do you see this approach logical ?
>>
>> Thanks
>> Shashi
>>
>>
>> On Sun, Jun 4, 2017 at 3:10 PM, Andrew Psaltis <ps...@gmail.com>
>> wrote:
>>
>>> Hi Shashi,
>>> I'm sure there is a way to make this work. However, my first question is
>>> why you would want to? By design a Spark Streaming application should
>>> always be running and consuming data from some source, hence the notion of
>>> streaming. Tying Spark Streaming to NiFi would ultimately result in a more
>>> coupled and fragile architecture. Perhaps a different way to think about it
>>> would be to set things up like this:
>>>
>>> NiFi --> Kafka <-- Spark Streaming
>>>
>>> With this you can do what you are doing today -- using NiFi to ingest,
>>> transform, make routing decisions, and feed data into Kafka. In essence you
>>> would be using NiFi to do all the preparation of the data for Spark
>>> Streaming. Kafka would serve the purpose of a buffer between NiFi and Spark
>>> Streaming. Finally, Spark Streaming would ingest data from Kafka and do
>>> what it is designed for -- stream processing. Having a decoupled
>>> architecture like this also allows you to manage each tier separately, thus
>>> you can tune, scale, develop, and deploy all separately.
>>>
>>> I know I did not directly answer your question on how to make it work.
>>> But, hopefully this helps provide an approach that will be a better long
>>> term solution. There may be something I am missing in your initial
>>> questions.
>>>
>>> Thanks,
>>> Andrew
>>>
>>>
>>>
>>> On Sat, Jun 3, 2017 at 10:43 PM, Shashi Vishwakarma <
>>> shashi.vish123@gmail.com> wrote:
>>>
>>>> Hi
>>>>
>>>> I am looking for way where I can make use of spark streaming in Nifi. I
>>>> see couple of post where SiteToSite tcp connection is used for spark
>>>> streaming application but I thinking it will be good If I can launch Spark
>>>> streaming from Nifi custom processor.
>>>>
>>>> PublishKafka will publish message into Kafka followed by Nifi Spark
>>>> streaming processor will read from Kafka Topic.
>>>>
>>>> I can launch Spark streaming application from custom Nifi processor
>>>> using Spark Streaming launcher API but biggest challenge is that it will
>>>> create spark streaming context for each flow file which can be costly
>>>> operation.
>>>>
>>>> Does any one suggest storing spark streaming context  in controller
>>>> service ? or any better approach for running spark streaming application
>>>> with Nifi ?
>>>>
>>>> Thanks and Regards,
>>>> Shashi
>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Thanks,
>>> Andrew
>>>
>>> Subscribe to my book: Streaming Data <http://manning.com/psaltis>
>>> <https://www.linkedin.com/pub/andrew-psaltis/1/17b/306>
>>> twiiter: @itmdata <http://twitter.com/intent/user?screen_name=itmdata>
>>>
>>
>> --
> Thanks,
> Andrew
>
> Subscribe to my book: Streaming Data <http://manning.com/psaltis>
> <https://www.linkedin.com/pub/andrew-psaltis/1/17b/306>
> twiiter: @itmdata <http://twitter.com/intent/user?screen_name=itmdata>
>

Re: Spark Streaming with Nifi

Posted by Andrew Psaltis <ps...@gmail.com>.

Hi Shashi,
Thanks for the explanation.  I have a better understanding of what you are
trying to accomplish. Although Spark streaming is micro-batch, you would
not want to keep launching jobs for each batch.   Think of it as the Spark
scheduler having a while loop in which it executes your job then sleeps for
X amount of time based on the interval you configure.

Perhaps a better way would be to do the following:
1. Use the S2S ProvenanceReportingTask to send provenance information from
your NiFi instance to a second instance or cluster.
2. In the second NiFi instance/cluster ( the one receiving the provenance
data) you write the data into say HBase or Solr or system X.
3. In your Spark streaming job you right into the same data store a
"provenance" event -- obviously this will not have all the fields that a
true NiFi provenance record does, but you can come close.

With this then once you would then have all provenance data in an external
system that you can query to understand the whole system.

Thanks,
Andrew

P.S. sorry if this is choppy or not well formed, on mobile.

On Sun, Jun 4, 2017 at 17:46 Shashi Vishwakarma <sh...@gmail.com>
wrote:

> Thanks Andrew.
>
> I agree that decoupling component is good solution from long term
> perspective. My current data pipeline in Nifi is designed for batch
> processing which I am trying to convert into streaming model.
>
> One of the processor in data pipeline invokes Spark job , once job
> finished control  is returned to Nifi processor in turn which generates
> provenance event for job. This provenance event is important for us.
>
> Keeping batch model architecture in mind, I want to designed spark
> streaming based model in which Nifi Spark streaming processor will process
> micro batch and job status will be returned to Nifi with provenance event.
> Then I can capture that provenance data for my reports.
>
> Essentially I will be using Nifi for capturing provenance event where
> actual processing will be done by Spark streaming job.
>
> Do you see this approach logical ?
>
> Thanks
> Shashi
>
>
> On Sun, Jun 4, 2017 at 3:10 PM, Andrew Psaltis <ps...@gmail.com>
> wrote:
>
>> Hi Shashi,
>> I'm sure there is a way to make this work. However, my first question is
>> why you would want to? By design a Spark Streaming application should
>> always be running and consuming data from some source, hence the notion of
>> streaming. Tying Spark Streaming to NiFi would ultimately result in a more
>> coupled and fragile architecture. Perhaps a different way to think about it
>> would be to set things up like this:
>>
>> NiFi --> Kafka <-- Spark Streaming
>>
>> With this you can do what you are doing today -- using NiFi to ingest,
>> transform, make routing decisions, and feed data into Kafka. In essence you
>> would be using NiFi to do all the preparation of the data for Spark
>> Streaming. Kafka would serve the purpose of a buffer between NiFi and Spark
>> Streaming. Finally, Spark Streaming would ingest data from Kafka and do
>> what it is designed for -- stream processing. Having a decoupled
>> architecture like this also allows you to manage each tier separately, thus
>> you can tune, scale, develop, and deploy all separately.
>>
>> I know I did not directly answer your question on how to make it work.
>> But, hopefully this helps provide an approach that will be a better long
>> term solution. There may be something I am missing in your initial
>> questions.
>>
>> Thanks,
>> Andrew
>>
>>
>>
>> On Sat, Jun 3, 2017 at 10:43 PM, Shashi Vishwakarma <
>> shashi.vish123@gmail.com> wrote:
>>
>>> Hi
>>>
>>> I am looking for way where I can make use of spark streaming in Nifi. I
>>> see couple of post where SiteToSite tcp connection is used for spark
>>> streaming application but I thinking it will be good If I can launch Spark
>>> streaming from Nifi custom processor.
>>>
>>> PublishKafka will publish message into Kafka followed by Nifi Spark
>>> streaming processor will read from Kafka Topic.
>>>
>>> I can launch Spark streaming application from custom Nifi processor
>>> using Spark Streaming launcher API but biggest challenge is that it will
>>> create spark streaming context for each flow file which can be costly
>>> operation.
>>>
>>> Does any one suggest storing spark streaming context  in controller
>>> service ? or any better approach for running spark streaming application
>>> with Nifi ?
>>>
>>> Thanks and Regards,
>>> Shashi
>>>
>>>
>>>
>>
>>
>> --
>> Thanks,
>> Andrew
>>
>> Subscribe to my book: Streaming Data <http://manning.com/psaltis>
>> <https://www.linkedin.com/pub/andrew-psaltis/1/17b/306>
>> twiiter: @itmdata <http://twitter.com/intent/user?screen_name=itmdata>
>>
>
> --
Thanks,
Andrew

Subscribe to my book: Streaming Data <http://manning.com/psaltis>
<https://www.linkedin.com/pub/andrew-psaltis/1/17b/306>
twiiter: @itmdata <http://twitter.com/intent/user?screen_name=itmdata>

Re: Spark Streaming with Nifi

Posted by Shashi Vishwakarma <sh...@gmail.com>.

Thanks Andrew.

I agree that decoupling component is good solution from long term
perspective. My current data pipeline in Nifi is designed for batch
processing which I am trying to convert into streaming model.

One of the processor in data pipeline invokes Spark job , once job finished
control  is returned to Nifi processor in turn which generates provenance
event for job. This provenance event is important for us.

Keeping batch model architecture in mind, I want to designed spark
streaming based model in which Nifi Spark streaming processor will process
micro batch and job status will be returned to Nifi with provenance event.
Then I can capture that provenance data for my reports.

Essentially I will be using Nifi for capturing provenance event where
actual processing will be done by Spark streaming job.

Do you see this approach logical ?

Thanks
Shashi


On Sun, Jun 4, 2017 at 3:10 PM, Andrew Psaltis <ps...@gmail.com>
wrote:

> Hi Shashi,
> I'm sure there is a way to make this work. However, my first question is
> why you would want to? By design a Spark Streaming application should
> always be running and consuming data from some source, hence the notion of
> streaming. Tying Spark Streaming to NiFi would ultimately result in a more
> coupled and fragile architecture. Perhaps a different way to think about it
> would be to set things up like this:
>
> NiFi --> Kafka <-- Spark Streaming
>
> With this you can do what you are doing today -- using NiFi to ingest,
> transform, make routing decisions, and feed data into Kafka. In essence you
> would be using NiFi to do all the preparation of the data for Spark
> Streaming. Kafka would serve the purpose of a buffer between NiFi and Spark
> Streaming. Finally, Spark Streaming would ingest data from Kafka and do
> what it is designed for -- stream processing. Having a decoupled
> architecture like this also allows you to manage each tier separately, thus
> you can tune, scale, develop, and deploy all separately.
>
> I know I did not directly answer your question on how to make it work.
> But, hopefully this helps provide an approach that will be a better long
> term solution. There may be something I am missing in your initial
> questions.
>
> Thanks,
> Andrew
>
>
>
> On Sat, Jun 3, 2017 at 10:43 PM, Shashi Vishwakarma <
> shashi.vish123@gmail.com> wrote:
>
>> Hi
>>
>> I am looking for way where I can make use of spark streaming in Nifi. I
>> see couple of post where SiteToSite tcp connection is used for spark
>> streaming application but I thinking it will be good If I can launch Spark
>> streaming from Nifi custom processor.
>>
>> PublishKafka will publish message into Kafka followed by Nifi Spark
>> streaming processor will read from Kafka Topic.
>>
>> I can launch Spark streaming application from custom Nifi processor using
>> Spark Streaming launcher API but biggest challenge is that it will create
>> spark streaming context for each flow file which can be costly operation.
>>
>> Does any one suggest storing spark streaming context  in controller
>> service ? or any better approach for running spark streaming application
>> with Nifi ?
>>
>> Thanks and Regards,
>> Shashi
>>
>>
>>
>
>
> --
> Thanks,
> Andrew
>
> Subscribe to my book: Streaming Data <http://manning.com/psaltis>
> <https://www.linkedin.com/pub/andrew-psaltis/1/17b/306>
> twiiter: @itmdata <http://twitter.com/intent/user?screen_name=itmdata>
>

Re: Spark Streaming with Nifi

Posted by Andrew Psaltis <ps...@gmail.com>.

Hi Shashi,
I'm sure there is a way to make this work. However, my first question is
why you would want to? By design a Spark Streaming application should
always be running and consuming data from some source, hence the notion of
streaming. Tying Spark Streaming to NiFi would ultimately result in a more
coupled and fragile architecture. Perhaps a different way to think about it
would be to set things up like this:

NiFi --> Kafka <-- Spark Streaming

With this you can do what you are doing today -- using NiFi to ingest,
transform, make routing decisions, and feed data into Kafka. In essence you
would be using NiFi to do all the preparation of the data for Spark
Streaming. Kafka would serve the purpose of a buffer between NiFi and Spark
Streaming. Finally, Spark Streaming would ingest data from Kafka and do
what it is designed for -- stream processing. Having a decoupled
architecture like this also allows you to manage each tier separately, thus
you can tune, scale, develop, and deploy all separately.

I know I did not directly answer your question on how to make it work. But,
hopefully this helps provide an approach that will be a better long term
solution. There may be something I am missing in your initial questions.

Thanks,
Andrew

On Sat, Jun 3, 2017 at 10:43 PM, Shashi Vishwakarma <
shashi.vish123@gmail.com> wrote:

> Hi
>
> I am looking for way where I can make use of spark streaming in Nifi. I
> see couple of post where SiteToSite tcp connection is used for spark
> streaming application but I thinking it will be good If I can launch Spark
> streaming from Nifi custom processor.
>
> PublishKafka will publish message into Kafka followed by Nifi Spark
> streaming processor will read from Kafka Topic.
>
> I can launch Spark streaming application from custom Nifi processor using
> Spark Streaming launcher API but biggest challenge is that it will create
> spark streaming context for each flow file which can be costly operation.
>
> Does any one suggest storing spark streaming context  in controller
> service ? or any better approach for running spark streaming application
> with Nifi ?
>
> Thanks and Regards,
> Shashi
>
>
>

-- 
Thanks,
Andrew

Subscribe to my book: Streaming Data <http://manning.com/psaltis>
<https://www.linkedin.com/pub/andrew-psaltis/1/17b/306>
twiiter: @itmdata <http://twitter.com/intent/user?screen_name=itmdata>