You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@nifi.apache.org by Mike Thomsen <mi...@gmail.com> on 2018/03/01 15:11:24 UTC

Re: Atlas and NiFi integration help

So I tried again, and finally got something populated (screenshot attached
for reference). What I don't see is anything like the provenance data that
the processors store. Like nothing about the flowfiles, their attributes,
etc.

My goal here is to have a long term, searchable repository of provenance
data so questions like "when was data set XYZ reindexed" can be answered.
Is the flowfile provenance data not being captured and sent to Atlas or am
I doing it wrong?

If the answer is "not yet" I'm cool with that and would be happy to take a
stab at expanding the scope of the reporting task's capabilities. I just
need someone more knowledgeable on this integration to give me pointers.

Thanks,

Mike

On Wed, Feb 28, 2018 at 2:43 PM, Mike Thomsen <mi...@gmail.com>
wrote:

> Matt,
>
> Yeah, I saw that pretty early on. Admittedly my question may be a bit
> nebulous. What I'm trying to figure out is what I should be seeing in Atlas
> if NiFi is sending it events properly. Since the integration and knowledge
> around it is probably clustered here, I'm not sure I can go to the Atlas
> list and ask them the same question.
>
> Thanks,
>
> Mike
>
> On Wed, Feb 28, 2018 at 2:13 PM, Matt Burgess <ma...@apache.org>
> wrote:
>
>> Mike,
>>
>> There is a nifi-atlas-bundle in NiFi with a NAR that includes the
>> ReportLineageToAtlas reporting task, but IIRC it is so large that it
>> is not included in the default assembly. Instead there is a
>> "include-atlas" profile that can be activated when building the
>> assembly, and that should include the Atlas NAR and associated
>> reporting task.
>>
>> Regards,
>> Matt
>>
>>
>> On Wed, Feb 28, 2018 at 1:42 PM, Mike Thomsen <mi...@gmail.com>
>> wrote:
>> > I have Atlas 0.8.2 (BerkeleyDB and Embedded ES) and NiFi 1.6.0 nightly
>> both
>> > up and claiming that they can talk to one another.
>> >
>> > What should I be seeing if they are? My test configuration consists of a
>> > simple process group that has GetMongo, UpdateAttributes and
>> > PutElasticSearchHttpRecord. I'm not sure if events are actually making
>> it.
>> >
>> > The Atlas documentation is pretty limited on setting up a vanilla
>> > installation, so I was wondering if someone could point me in the right
>> > direction from a NiFi point of view on what I should be seeing so I can
>> > start fumbling around in the right direction.
>> >
>> > Thanks,
>> >
>> > Mike
>>
>
>

Re: Atlas and NiFi integration help

Posted by Mike Thomsen <mi...@gmail.com>.
Bryan,

That did it. It might not be able to answer at the granularity of how many
times a reindex was done, but with the right mix of updateattributes and
that task, I was able to build an ElasticSearch index that can at least
show a date histogram aggregation of when reindex operations happened by
data set.

On Thu, Mar 1, 2018 at 11:23 AM, Bryan Bende <bb...@gmail.com> wrote:

> Mike,
>
> That is basically the point of SiteToSiteProvenanceReportingTask...
> you send the provenance events from reporting task back to the same
> cluster, and then leverage existing processors like the ElasticSearch
> processors.
>
> Otherwise we'd get into building 100 reporting tasks for all the
> various destinations, just like all the processors.
>
> -Bryan
>
> On Thu, Mar 1, 2018 at 11:04 AM, Mike Thomsen <mi...@gmail.com>
> wrote:
> > Bryan,
> >
> > I have a feeling you're right. This might call for a reporting task that
> > exports to ElasticSearch so that Kibana dashboards can be used to answer
> > these questions.
> >
> > Thanks,
> >
> > Mike
> >
> > On Thu, Mar 1, 2018 at 10:20 AM, Bryan Bende <bb...@gmail.com> wrote:
> >>
> >> Mike,
> >>
> >> As far as I know, Atlas is not really about "event level" lineage, it
> >> is more about "system level" or "data set' level.
> >>
> >> So I believe the goal of Atlas is to show how the systems are
> >> connected and how a particular data set flows through the system.
> >>
> >> So an example might be... NiFi pulls from source #1, then publishes to
> >> Kafka topic #1,  and then a stream processing system consumes from
> >> Kafka topic #1, and then writes results to Hive.
> >>
> >> Atlas can then tell you that source #1 flowed through all these
> >> systems and was the source for these results in Hive (something like
> >> that).
> >>
> >> I don't think its a massive long-term store for event-level provenance
> >> data like NiFi has, but others can chime in here if I am wrong.
> >>
> >> -Bryan
> >>
> >>
> >> On Thu, Mar 1, 2018 at 10:11 AM, Mike Thomsen <mi...@gmail.com>
> >> wrote:
> >> > So I tried again, and finally got something populated (screenshot
> >> > attached
> >> > for reference). What I don't see is anything like the provenance data
> >> > that
> >> > the processors store. Like nothing about the flowfiles, their
> >> > attributes,
> >> > etc.
> >> >
> >> > My goal here is to have a long term, searchable repository of
> provenance
> >> > data so questions like "when was data set XYZ reindexed" can be
> >> > answered. Is
> >> > the flowfile provenance data not being captured and sent to Atlas or
> am
> >> > I
> >> > doing it wrong?
> >> >
> >> > If the answer is "not yet" I'm cool with that and would be happy to
> take
> >> > a
> >> > stab at expanding the scope of the reporting task's capabilities. I
> just
> >> > need someone more knowledgeable on this integration to give me
> pointers.
> >> >
> >> > Thanks,
> >> >
> >> > Mike
> >> >
> >> > On Wed, Feb 28, 2018 at 2:43 PM, Mike Thomsen <mikerthomsen@gmail.com
> >
> >> > wrote:
> >> >>
> >> >> Matt,
> >> >>
> >> >> Yeah, I saw that pretty early on. Admittedly my question may be a bit
> >> >> nebulous. What I'm trying to figure out is what I should be seeing in
> >> >> Atlas
> >> >> if NiFi is sending it events properly. Since the integration and
> >> >> knowledge
> >> >> around it is probably clustered here, I'm not sure I can go to the
> >> >> Atlas
> >> >> list and ask them the same question.
> >> >>
> >> >> Thanks,
> >> >>
> >> >> Mike
> >> >>
> >> >> On Wed, Feb 28, 2018 at 2:13 PM, Matt Burgess <ma...@apache.org>
> >> >> wrote:
> >> >>>
> >> >>> Mike,
> >> >>>
> >> >>> There is a nifi-atlas-bundle in NiFi with a NAR that includes the
> >> >>> ReportLineageToAtlas reporting task, but IIRC it is so large that it
> >> >>> is not included in the default assembly. Instead there is a
> >> >>> "include-atlas" profile that can be activated when building the
> >> >>> assembly, and that should include the Atlas NAR and associated
> >> >>> reporting task.
> >> >>>
> >> >>> Regards,
> >> >>> Matt
> >> >>>
> >> >>>
> >> >>> On Wed, Feb 28, 2018 at 1:42 PM, Mike Thomsen <
> mikerthomsen@gmail.com>
> >> >>> wrote:
> >> >>> > I have Atlas 0.8.2 (BerkeleyDB and Embedded ES) and NiFi 1.6.0
> >> >>> > nightly
> >> >>> > both
> >> >>> > up and claiming that they can talk to one another.
> >> >>> >
> >> >>> > What should I be seeing if they are? My test configuration
> consists
> >> >>> > of
> >> >>> > a
> >> >>> > simple process group that has GetMongo, UpdateAttributes and
> >> >>> > PutElasticSearchHttpRecord. I'm not sure if events are actually
> >> >>> > making
> >> >>> > it.
> >> >>> >
> >> >>> > The Atlas documentation is pretty limited on setting up a vanilla
> >> >>> > installation, so I was wondering if someone could point me in the
> >> >>> > right
> >> >>> > direction from a NiFi point of view on what I should be seeing so
> I
> >> >>> > can
> >> >>> > start fumbling around in the right direction.
> >> >>> >
> >> >>> > Thanks,
> >> >>> >
> >> >>> > Mike
> >> >>
> >> >>
> >> >
> >
> >
>

Re: Atlas and NiFi integration help

Posted by Bryan Bende <bb...@gmail.com>.
Mike,

That is basically the point of SiteToSiteProvenanceReportingTask...
you send the provenance events from reporting task back to the same
cluster, and then leverage existing processors like the ElasticSearch
processors.

Otherwise we'd get into building 100 reporting tasks for all the
various destinations, just like all the processors.

-Bryan

On Thu, Mar 1, 2018 at 11:04 AM, Mike Thomsen <mi...@gmail.com> wrote:
> Bryan,
>
> I have a feeling you're right. This might call for a reporting task that
> exports to ElasticSearch so that Kibana dashboards can be used to answer
> these questions.
>
> Thanks,
>
> Mike
>
> On Thu, Mar 1, 2018 at 10:20 AM, Bryan Bende <bb...@gmail.com> wrote:
>>
>> Mike,
>>
>> As far as I know, Atlas is not really about "event level" lineage, it
>> is more about "system level" or "data set' level.
>>
>> So I believe the goal of Atlas is to show how the systems are
>> connected and how a particular data set flows through the system.
>>
>> So an example might be... NiFi pulls from source #1, then publishes to
>> Kafka topic #1,  and then a stream processing system consumes from
>> Kafka topic #1, and then writes results to Hive.
>>
>> Atlas can then tell you that source #1 flowed through all these
>> systems and was the source for these results in Hive (something like
>> that).
>>
>> I don't think its a massive long-term store for event-level provenance
>> data like NiFi has, but others can chime in here if I am wrong.
>>
>> -Bryan
>>
>>
>> On Thu, Mar 1, 2018 at 10:11 AM, Mike Thomsen <mi...@gmail.com>
>> wrote:
>> > So I tried again, and finally got something populated (screenshot
>> > attached
>> > for reference). What I don't see is anything like the provenance data
>> > that
>> > the processors store. Like nothing about the flowfiles, their
>> > attributes,
>> > etc.
>> >
>> > My goal here is to have a long term, searchable repository of provenance
>> > data so questions like "when was data set XYZ reindexed" can be
>> > answered. Is
>> > the flowfile provenance data not being captured and sent to Atlas or am
>> > I
>> > doing it wrong?
>> >
>> > If the answer is "not yet" I'm cool with that and would be happy to take
>> > a
>> > stab at expanding the scope of the reporting task's capabilities. I just
>> > need someone more knowledgeable on this integration to give me pointers.
>> >
>> > Thanks,
>> >
>> > Mike
>> >
>> > On Wed, Feb 28, 2018 at 2:43 PM, Mike Thomsen <mi...@gmail.com>
>> > wrote:
>> >>
>> >> Matt,
>> >>
>> >> Yeah, I saw that pretty early on. Admittedly my question may be a bit
>> >> nebulous. What I'm trying to figure out is what I should be seeing in
>> >> Atlas
>> >> if NiFi is sending it events properly. Since the integration and
>> >> knowledge
>> >> around it is probably clustered here, I'm not sure I can go to the
>> >> Atlas
>> >> list and ask them the same question.
>> >>
>> >> Thanks,
>> >>
>> >> Mike
>> >>
>> >> On Wed, Feb 28, 2018 at 2:13 PM, Matt Burgess <ma...@apache.org>
>> >> wrote:
>> >>>
>> >>> Mike,
>> >>>
>> >>> There is a nifi-atlas-bundle in NiFi with a NAR that includes the
>> >>> ReportLineageToAtlas reporting task, but IIRC it is so large that it
>> >>> is not included in the default assembly. Instead there is a
>> >>> "include-atlas" profile that can be activated when building the
>> >>> assembly, and that should include the Atlas NAR and associated
>> >>> reporting task.
>> >>>
>> >>> Regards,
>> >>> Matt
>> >>>
>> >>>
>> >>> On Wed, Feb 28, 2018 at 1:42 PM, Mike Thomsen <mi...@gmail.com>
>> >>> wrote:
>> >>> > I have Atlas 0.8.2 (BerkeleyDB and Embedded ES) and NiFi 1.6.0
>> >>> > nightly
>> >>> > both
>> >>> > up and claiming that they can talk to one another.
>> >>> >
>> >>> > What should I be seeing if they are? My test configuration consists
>> >>> > of
>> >>> > a
>> >>> > simple process group that has GetMongo, UpdateAttributes and
>> >>> > PutElasticSearchHttpRecord. I'm not sure if events are actually
>> >>> > making
>> >>> > it.
>> >>> >
>> >>> > The Atlas documentation is pretty limited on setting up a vanilla
>> >>> > installation, so I was wondering if someone could point me in the
>> >>> > right
>> >>> > direction from a NiFi point of view on what I should be seeing so I
>> >>> > can
>> >>> > start fumbling around in the right direction.
>> >>> >
>> >>> > Thanks,
>> >>> >
>> >>> > Mike
>> >>
>> >>
>> >
>
>

Re: Atlas and NiFi integration help

Posted by Mike Thomsen <mi...@gmail.com>.
Bryan,

I have a feeling you're right. This might call for a reporting task that
exports to ElasticSearch so that Kibana dashboards can be used to answer
these questions.

Thanks,

Mike

On Thu, Mar 1, 2018 at 10:20 AM, Bryan Bende <bb...@gmail.com> wrote:

> Mike,
>
> As far as I know, Atlas is not really about "event level" lineage, it
> is more about "system level" or "data set' level.
>
> So I believe the goal of Atlas is to show how the systems are
> connected and how a particular data set flows through the system.
>
> So an example might be... NiFi pulls from source #1, then publishes to
> Kafka topic #1,  and then a stream processing system consumes from
> Kafka topic #1, and then writes results to Hive.
>
> Atlas can then tell you that source #1 flowed through all these
> systems and was the source for these results in Hive (something like
> that).
>
> I don't think its a massive long-term store for event-level provenance
> data like NiFi has, but others can chime in here if I am wrong.
>
> -Bryan
>
>
> On Thu, Mar 1, 2018 at 10:11 AM, Mike Thomsen <mi...@gmail.com>
> wrote:
> > So I tried again, and finally got something populated (screenshot
> attached
> > for reference). What I don't see is anything like the provenance data
> that
> > the processors store. Like nothing about the flowfiles, their attributes,
> > etc.
> >
> > My goal here is to have a long term, searchable repository of provenance
> > data so questions like "when was data set XYZ reindexed" can be
> answered. Is
> > the flowfile provenance data not being captured and sent to Atlas or am I
> > doing it wrong?
> >
> > If the answer is "not yet" I'm cool with that and would be happy to take
> a
> > stab at expanding the scope of the reporting task's capabilities. I just
> > need someone more knowledgeable on this integration to give me pointers.
> >
> > Thanks,
> >
> > Mike
> >
> > On Wed, Feb 28, 2018 at 2:43 PM, Mike Thomsen <mi...@gmail.com>
> > wrote:
> >>
> >> Matt,
> >>
> >> Yeah, I saw that pretty early on. Admittedly my question may be a bit
> >> nebulous. What I'm trying to figure out is what I should be seeing in
> Atlas
> >> if NiFi is sending it events properly. Since the integration and
> knowledge
> >> around it is probably clustered here, I'm not sure I can go to the Atlas
> >> list and ask them the same question.
> >>
> >> Thanks,
> >>
> >> Mike
> >>
> >> On Wed, Feb 28, 2018 at 2:13 PM, Matt Burgess <ma...@apache.org>
> >> wrote:
> >>>
> >>> Mike,
> >>>
> >>> There is a nifi-atlas-bundle in NiFi with a NAR that includes the
> >>> ReportLineageToAtlas reporting task, but IIRC it is so large that it
> >>> is not included in the default assembly. Instead there is a
> >>> "include-atlas" profile that can be activated when building the
> >>> assembly, and that should include the Atlas NAR and associated
> >>> reporting task.
> >>>
> >>> Regards,
> >>> Matt
> >>>
> >>>
> >>> On Wed, Feb 28, 2018 at 1:42 PM, Mike Thomsen <mi...@gmail.com>
> >>> wrote:
> >>> > I have Atlas 0.8.2 (BerkeleyDB and Embedded ES) and NiFi 1.6.0
> nightly
> >>> > both
> >>> > up and claiming that they can talk to one another.
> >>> >
> >>> > What should I be seeing if they are? My test configuration consists
> of
> >>> > a
> >>> > simple process group that has GetMongo, UpdateAttributes and
> >>> > PutElasticSearchHttpRecord. I'm not sure if events are actually
> making
> >>> > it.
> >>> >
> >>> > The Atlas documentation is pretty limited on setting up a vanilla
> >>> > installation, so I was wondering if someone could point me in the
> right
> >>> > direction from a NiFi point of view on what I should be seeing so I
> can
> >>> > start fumbling around in the right direction.
> >>> >
> >>> > Thanks,
> >>> >
> >>> > Mike
> >>
> >>
> >
>

Re: Atlas and NiFi integration help

Posted by Bryan Bende <bb...@gmail.com>.
Mike,

As far as I know, Atlas is not really about "event level" lineage, it
is more about "system level" or "data set' level.

So I believe the goal of Atlas is to show how the systems are
connected and how a particular data set flows through the system.

So an example might be... NiFi pulls from source #1, then publishes to
Kafka topic #1,  and then a stream processing system consumes from
Kafka topic #1, and then writes results to Hive.

Atlas can then tell you that source #1 flowed through all these
systems and was the source for these results in Hive (something like
that).

I don't think its a massive long-term store for event-level provenance
data like NiFi has, but others can chime in here if I am wrong.

-Bryan


On Thu, Mar 1, 2018 at 10:11 AM, Mike Thomsen <mi...@gmail.com> wrote:
> So I tried again, and finally got something populated (screenshot attached
> for reference). What I don't see is anything like the provenance data that
> the processors store. Like nothing about the flowfiles, their attributes,
> etc.
>
> My goal here is to have a long term, searchable repository of provenance
> data so questions like "when was data set XYZ reindexed" can be answered. Is
> the flowfile provenance data not being captured and sent to Atlas or am I
> doing it wrong?
>
> If the answer is "not yet" I'm cool with that and would be happy to take a
> stab at expanding the scope of the reporting task's capabilities. I just
> need someone more knowledgeable on this integration to give me pointers.
>
> Thanks,
>
> Mike
>
> On Wed, Feb 28, 2018 at 2:43 PM, Mike Thomsen <mi...@gmail.com>
> wrote:
>>
>> Matt,
>>
>> Yeah, I saw that pretty early on. Admittedly my question may be a bit
>> nebulous. What I'm trying to figure out is what I should be seeing in Atlas
>> if NiFi is sending it events properly. Since the integration and knowledge
>> around it is probably clustered here, I'm not sure I can go to the Atlas
>> list and ask them the same question.
>>
>> Thanks,
>>
>> Mike
>>
>> On Wed, Feb 28, 2018 at 2:13 PM, Matt Burgess <ma...@apache.org>
>> wrote:
>>>
>>> Mike,
>>>
>>> There is a nifi-atlas-bundle in NiFi with a NAR that includes the
>>> ReportLineageToAtlas reporting task, but IIRC it is so large that it
>>> is not included in the default assembly. Instead there is a
>>> "include-atlas" profile that can be activated when building the
>>> assembly, and that should include the Atlas NAR and associated
>>> reporting task.
>>>
>>> Regards,
>>> Matt
>>>
>>>
>>> On Wed, Feb 28, 2018 at 1:42 PM, Mike Thomsen <mi...@gmail.com>
>>> wrote:
>>> > I have Atlas 0.8.2 (BerkeleyDB and Embedded ES) and NiFi 1.6.0 nightly
>>> > both
>>> > up and claiming that they can talk to one another.
>>> >
>>> > What should I be seeing if they are? My test configuration consists of
>>> > a
>>> > simple process group that has GetMongo, UpdateAttributes and
>>> > PutElasticSearchHttpRecord. I'm not sure if events are actually making
>>> > it.
>>> >
>>> > The Atlas documentation is pretty limited on setting up a vanilla
>>> > installation, so I was wondering if someone could point me in the right
>>> > direction from a NiFi point of view on what I should be seeing so I can
>>> > start fumbling around in the right direction.
>>> >
>>> > Thanks,
>>> >
>>> > Mike
>>
>>
>