You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@nifi.apache.org by milind parikh <mi...@gmail.com> on 2016/07/07 05:27:57 UTC

Data Provenance @scale in Nifi

I am relatively new to Nifi. I have written a processor in Java for Nifi (
which gives you an understanding of my knowledge about nifi; which is
little)

I have a scenario where there are about 100k flow files a day representing
about 100m records; which needs to be aggregated across 1m data points
across 100 dimensions.

If in my architecture, I split the initial flow file into records and write
them into Kafka for 1000 records per flow file and read in parallel,  how
do I do data provenance of the aggregated values.

The use case that I am interested in is showing how one of the data points
( out  of 1m) arrived at the daily aggregated value for an average of 100
records coming out of very few of the 100k files.

I can't expand the data provenance through the UI (1000 initial records )
and THEN through 1m data points OR traverse through 1 m data points in the
UI as my starting point.

I know the exact reference of the data point ( it's truncated version of
the sha1 of a complex but unique datapoint string).

Is there a command line equivalent of the UI that can be more precisely
targeted for one data point?

Thanks
Milind

Re: Data Provenance @scale in Nifi

Posted by milind parikh <mi...@gmail.com>.
Hi Bryan

Thanks. This helps!

Regards
Milind
On Jul 7, 2016 8:13 AM, "Bryan Bende" <bb...@gmail.com> wrote:

> Milind,
>
> I'm not sure if I understand the question correctly, but are you asking
> how to find a specific provenance event beyond the 1,000 most recent that
> are displayed when loading the provenance view?
>
> If so, there is a Search button in the top right of the Provenance window
> that brings up a search window to search on specific fields or time ranges.
>
> The fields available to search on can be customized in nifi.properties
> through the following:
>
> # Comma-separated list of fields. Fields that are not indexed will not be
> searchable. Valid fields are:
> # EventType, FlowFileUUID, Filename, TransitURI, ProcessorID,
> AlternateIdentifierURI, Relationship, Details
> nifi.provenance.repository.indexed.fields=EventType, FlowFileUUID,
> Filename, ProcessorID
>
> # FlowFile Attributes that should be indexed and made searchable
> nifi.provenance.repository.indexed.attributes=twitter.msg, language
>
> In the above example, the attributes twitter.msg and language are
> attributes that are being extracted from tweets using EvalueJSONPath.
>
> Does this help?
>
> -Bryan
>
> On Thu, Jul 7, 2016 at 1:27 AM, milind parikh <mi...@gmail.com>
> wrote:
>
>> I am relatively new to Nifi. I have written a processor in Java for Nifi
>> ( which gives you an understanding of my knowledge about nifi; which is
>> little)
>>
>> I have a scenario where there are about 100k flow files a day
>> representing about 100m records; which needs to be aggregated across 1m
>> data points across 100 dimensions.
>>
>> If in my architecture, I split the initial flow file into records and
>> write them into Kafka for 1000 records per flow file and read in parallel,
>> how do I do data provenance of the aggregated values.
>>
>> The use case that I am interested in is showing how one of the data
>> points ( out  of 1m) arrived at the daily aggregated value for an average
>> of 100 records coming out of very few of the 100k files.
>>
>> I can't expand the data provenance through the UI (1000 initial records )
>> and THEN through 1m data points OR traverse through 1 m data points in the
>> UI as my starting point.
>>
>> I know the exact reference of the data point ( it's truncated version of
>> the sha1 of a complex but unique datapoint string).
>>
>> Is there a command line equivalent of the UI that can be more precisely
>> targeted for one data point?
>>
>> Thanks
>> Milind
>>
>
>

Re: Data Provenance @scale in Nifi

Posted by Bryan Bende <bb...@gmail.com>.
Milind,

I'm not sure if I understand the question correctly, but are you asking how
to find a specific provenance event beyond the 1,000 most recent that are
displayed when loading the provenance view?

If so, there is a Search button in the top right of the Provenance window
that brings up a search window to search on specific fields or time ranges.

The fields available to search on can be customized in nifi.properties
through the following:

# Comma-separated list of fields. Fields that are not indexed will not be
searchable. Valid fields are:
# EventType, FlowFileUUID, Filename, TransitURI, ProcessorID,
AlternateIdentifierURI, Relationship, Details
nifi.provenance.repository.indexed.fields=EventType, FlowFileUUID,
Filename, ProcessorID

# FlowFile Attributes that should be indexed and made searchable
nifi.provenance.repository.indexed.attributes=twitter.msg, language

In the above example, the attributes twitter.msg and language are
attributes that are being extracted from tweets using EvalueJSONPath.

Does this help?

-Bryan

On Thu, Jul 7, 2016 at 1:27 AM, milind parikh <mi...@gmail.com>
wrote:

> I am relatively new to Nifi. I have written a processor in Java for Nifi (
> which gives you an understanding of my knowledge about nifi; which is
> little)
>
> I have a scenario where there are about 100k flow files a day representing
> about 100m records; which needs to be aggregated across 1m data points
> across 100 dimensions.
>
> If in my architecture, I split the initial flow file into records and
> write them into Kafka for 1000 records per flow file and read in parallel,
> how do I do data provenance of the aggregated values.
>
> The use case that I am interested in is showing how one of the data points
> ( out  of 1m) arrived at the daily aggregated value for an average of 100
> records coming out of very few of the 100k files.
>
> I can't expand the data provenance through the UI (1000 initial records )
> and THEN through 1m data points OR traverse through 1 m data points in the
> UI as my starting point.
>
> I know the exact reference of the data point ( it's truncated version of
> the sha1 of a complex but unique datapoint string).
>
> Is there a command line equivalent of the UI that can be more precisely
> targeted for one data point?
>
> Thanks
> Milind
>