You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@nifi.apache.org by Uwe Geercken <uw...@web.de> on 2017/04/14 15:16:58 UTC

4 Processors for NiFi: SplitToAttribute, MergeTemplate, ExecuteRuleEngine, GenerateData

Hello everybody,
 
I have released (Apache License) my NiFi processors at:
 
https://github.com/uwegeercken/nifi_processors[https://deref-web-02.de/mail/client/WDRw2DECNaw/dereferrer/?redirectUrl=https%3A%2F%2Fgithub.com%2Fuwegeercken%2Fnifi_processors]
 
Further below is a summary for the processors. I would like to invite everybody to test, look at the source code and send me any feedback that you have.
 
I have done a lot of testing but have not been able e.g. to test it in a cluster setup or with very large amounts of data. Also I am german native speaking - maybe some of the wording in the processors or documentation could be enhanced.
 
Nifi rocks!
 
Uwe
 
==========================
 
Description of processors:
 
1.
The SplitToAttribute processor for Apache Nifi will allow to split the incoming content (CSV) of a flowfile into separate fields using a defined separator.
The values of the individual fields will be assigned to flowfile attributes. Each attribute is named using the defined field prefix plus the positional number of the field.
A number format can optionally be specified to format the column number. The number format needs to be according to the Java DecimalFormat class.
Example:
A flow file with following content:
Peterson, Jenny, New York, USA
When the field prefix is set to "column_" and the field number format is set to "000" the result will be 4 attributes:
column_000 = Peterson column_001 = Jenny column_002 = New York column_003 = USA
Note that this processor can be used together with the MergeTemplate processor, which merges the flow file attributes with Apache Velocity templates.

2.
The MergeTemplate processor for Apache Nifi will allow to merge the attributes from a flowfile with an Apache Velocity template. The Velocity template contains placeholders (e.g. $column0 - alternatively in brackets: ${column0}).
In the merge process the attributes of the flowfile will be merged with the template and the placeholders are replaced with the attribute values.
See the Apache Velocity website at http://velocity.apache.org[https://deref-web-02.de/mail/client/mHBrImoMhLc/dereferrer/?redirectUrl=http%3A%2F%2Fvelocity.apache.org] for details on the template engine.
A filter (regular expression) has to be specified, defining which attributes shall be considered for the template engine.
The original file will be routed to the "original" relation and the result of the merge process will replace the content of the flowfile and is routed to the "merged" relationship.
Example:
A flow file with following attributes:
column0 = Peterson column1 = Jenny column2 = New York column3 = USA
A template file "names.vm" with below format. Placeholders start with a dollar sign and are optionally in curly brackets:
{ "name": "$column0", "first": "$column1", "city": "$column2", "country": "$column3" }
After the attributes are merged with the template, the placeholders in the template are replaced with the values from the flowfile attributes. This is the result:
{ "name": "Peterson", "first": "Jenny", "city": "New York", "country": "USA" }
Can be used for any textual data formats such as CSV, HTML, XML, Json, etc.

3.
The ExecuteRuleEngine processor allows to process rows of data from CSV files. It runs business rules against the data and then updates the flowfile attributes based on the results of the ruleengine. One can then route the flowfile based on these attributes.
The processor requires to set the ruleengine project zip file and a separator. The project zip file can be created with the Business Rules Maintenance Tool - a web application to construct and orchestrate business logic (business rules). The projects from the web app can be exported and used with the RuleEngine processor. When the ruleengine runs, it splits the incomming row of data (the flowfile content) into individual fields. So the separator defines how the fields are separated from each other.
The advantage of using this processor and a ruleengine is that the business logic can be defined outside of Nifi. And thus if the logic changes, the flow does NOT have to be changed. The ruleengine can be used to define complex business logic. E.g. "Lastname must be XXX and age must be greater than 25 and country must be Germany or France". This would be difficult to model in nifi and would clutter the flow. Managed in the web app the business logic can modified in an agile way and the flow in Nifi stays clean and lean.
Note that there is a test project file: nifi_test2_dev.zip and it can be used with the test data file: allCountries_100.txt. If you use these files, you won't need to install/use the Business Rules Maintenance Tool web application.

4.
The GenerateData processor generates random data from word lists, regular expressions or purely random. The output is in CSV format. This is useful if you want to generate mass data, but mass data which makes sense (from wordlists e.g.). Also there are some nice features for generating dependent date columns.

 

Re: Aw: Re: 4 Processors for NiFi: SplitToAttribute, MergeTemplate, ExecuteRuleEngine, GenerateData

Posted by Joe Witt <jo...@gmail.com>.
No problem.

Docs on the components are there but we will went to build a lot of
solutions oriented docs soon.

On Apr 14, 2017 12:04 PM, "Uwe Geercken" <uw...@web.de> wrote:

>
>
> Joe,
>
> will do so - sorry for that and thanks for the link to the wiki.
>
> I am still in a learning phase. Programming helps me to dig in more
> details and better understand the things under the hood. I will be
> enhancing my processors over time and will put words in the documentation
> that point out the fact with the memory. Actually I will think about if
> this can be done differently without using the flow file attributes.
>
> About RecordReader/Writer. Is there documentation and maybe a sample
> somewhere explaining it's usage? I will look at QueryFlowFile to see what
> is does.
>
> Rgds,
>
> Uwe
>
>
> Gesendet: Freitag, 14. April 2017 um 17:27 Uhr
> Von: "Joe Witt" <jo...@gmail.com>
> An: dev@nifi.apache.org
> Betreff: Re: 4 Processors for NiFi: SplitToAttribute, MergeTemplate,
> ExecuteRuleEngine, GenerateData
> Uwe,
>
> Please avoid cross posting to both dev and user lists. I recognize
> that you'd like to let folks know these exist and that is fine. We
> started up this wiki page to help folks just like you let others know
> about processors they've built that they're happy to have others use:
> https://cwiki.apache.org/confluence/display/NIFI/Community+Contributions
> We can give you access to update that if you like.
>
> For the CSV to attributes to processor one thing I wanted to mention
> is that it would be good to document, if you've not already,
> limitations this will create with regard to memory usage. Such an
> approach takes the content of flowfiles and promotes it to attributes
> of flowfiles. This will mean that a lot more than is often necessary
> will be held in memory. This might be perfectly fine for many use
> cases so it isn't a deal breaker but it is more of a notice you'd want
> to give people. I replied with a bit more context on this to your
> more recent email. Something you'll likely want to checkout will be
> in the nifi 1.2.0 release which is a new RecordReader/RecordWriter
> abstraction over flowfile content that people can use in new
> processors like QueryFlowFile which will let you do things like
> execute SQL queries over flowfile content which can be read in record
> at a time. Their are record readers already for CSV, JSON, AVRO, and
> Grok with writers in CSV, Avro, JSON, and generic text.
>
> Thanks
> Joe
>
> On Fri, Apr 14, 2017 at 11:16 AM, Uwe Geercken <uw...@web.de>
> wrote:
> > Hello everybody,
> >
> > I have released (Apache License) my NiFi processors at:
> >
> > https://github.com/uwegeercken/nifi_processors[https://github.com/
> uwegeercken/nifi_processors][https://deref-web-02.de/mail/
> client/WDRw2DECNaw/dereferrer/?redirectUrl=https%3A%2F%
> 2Fgithub.com%2Fuwegeercken%2Fnifi_processors[https://
> deref-web-02.de/mail/client/WDRw2DECNaw/dereferrer/?
> redirectUrl=https%3A%2F%2Fgithub.com%2Fuwegeercken%2Fnifi_processors]]
> >
> > Further below is a summary for the processors. I would like to invite
> everybody to test, look at the source code and send me any feedback that
> you have.
> >
> > I have done a lot of testing but have not been able e.g. to test it in a
> cluster setup or with very large amounts of data. Also I am german native
> speaking - maybe some of the wording in the processors or documentation
> could be enhanced.
> >
> > Nifi rocks!
> >
> > Uwe
> >
> > ==========================
> >
> > Description of processors:
> >
> > 1.
> > The SplitToAttribute processor for Apache Nifi will allow to split the
> incoming content (CSV) of a flowfile into separate fields using a defined
> separator.
> > The values of the individual fields will be assigned to flowfile
> attributes. Each attribute is named using the defined field prefix plus the
> positional number of the field.
> > A number format can optionally be specified to format the column number.
> The number format needs to be according to the Java DecimalFormat class.
> > Example:
> > A flow file with following content:
> > Peterson, Jenny, New York, USA
> > When the field prefix is set to "column_" and the field number format is
> set to "000" the result will be 4 attributes:
> > column_000 = Peterson column_001 = Jenny column_002 = New York
> column_003 = USA
> > Note that this processor can be used together with the MergeTemplate
> processor, which merges the flow file attributes with Apache Velocity
> templates.
> >
> > 2.
> > The MergeTemplate processor for Apache Nifi will allow to merge the
> attributes from a flowfile with an Apache Velocity template. The Velocity
> template contains placeholders (e.g. $column0 - alternatively in brackets:
> ${column0}).
> > In the merge process the attributes of the flowfile will be merged with
> the template and the placeholders are replaced with the attribute values.
> > See the Apache Velocity website at http://velocity.apache.org[htt
> p://velocity.apache.org][https://deref-web-02.de/mail/
> client/mHBrImoMhLc/dereferrer/?redirectUrl=http%3A%2F%
> 2Fvelocity.apache.org[https://deref-web-02.de/mail/client/
> mHBrImoMhLc/dereferrer/?redirectUrl=http%3A%2F%2Fvelocity.apache.org]]
> for details on the template engine.
> > A filter (regular expression) has to be specified, defining which
> attributes shall be considered for the template engine.
> > The original file will be routed to the "original" relation and the
> result of the merge process will replace the content of the flowfile and is
> routed to the "merged" relationship.
> > Example:
> > A flow file with following attributes:
> > column0 = Peterson column1 = Jenny column2 = New York column3 = USA
> > A template file "names.vm" with below format. Placeholders start with a
> dollar sign and are optionally in curly brackets:
> > { "name": "$column0", "first": "$column1", "city": "$column2",
> "country": "$column3" }
> > After the attributes are merged with the template, the placeholders in
> the template are replaced with the values from the flowfile attributes.
> This is the result:
> > { "name": "Peterson", "first": "Jenny", "city": "New York", "country":
> "USA" }
> > Can be used for any textual data formats such as CSV, HTML, XML, Json,
> etc.
> >
> > 3.
> > The ExecuteRuleEngine processor allows to process rows of data from CSV
> files. It runs business rules against the data and then updates the
> flowfile attributes based on the results of the ruleengine. One can then
> route the flowfile based on these attributes.
> > The processor requires to set the ruleengine project zip file and a
> separator. The project zip file can be created with the Business Rules
> Maintenance Tool - a web application to construct and orchestrate business
> logic (business rules). The projects from the web app can be exported and
> used with the RuleEngine processor. When the ruleengine runs, it splits the
> incomming row of data (the flowfile content) into individual fields. So the
> separator defines how the fields are separated from each other.
> > The advantage of using this processor and a ruleengine is that the
> business logic can be defined outside of Nifi. And thus if the logic
> changes, the flow does NOT have to be changed. The ruleengine can be used
> to define complex business logic. E.g. "Lastname must be XXX and age must
> be greater than 25 and country must be Germany or France". This would be
> difficult to model in nifi and would clutter the flow. Managed in the web
> app the business logic can modified in an agile way and the flow in Nifi
> stays clean and lean.
> > Note that there is a test project file: nifi_test2_dev.zip and it can be
> used with the test data file: allCountries_100.txt. If you use these files,
> you won't need to install/use the Business Rules Maintenance Tool web
> application.
> >
> > 4.
> > The GenerateData processor generates random data from word lists,
> regular expressions or purely random. The output is in CSV format. This is
> useful if you want to generate mass data, but mass data which makes sense
> (from wordlists e.g.). Also there are some nice features for generating
> dependent date columns.
> >
> >
>

Aw: Re: 4 Processors for NiFi: SplitToAttribute, MergeTemplate, ExecuteRuleEngine, GenerateData

Posted by Uwe Geercken <uw...@web.de>.

Joe,
 
will do so - sorry for that and thanks for the link to the wiki.
 
I am still in a learning phase. Programming helps me to dig in more details and better understand the things under the hood. I will be enhancing my processors over time and will put words in the documentation that point out the fact with the memory. Actually I will think about if this can be done differently without using the flow file attributes.
 
About RecordReader/Writer. Is there documentation and maybe a sample somewhere explaining it's usage? I will look at QueryFlowFile to see what is does.

Rgds,

Uwe
 

Gesendet: Freitag, 14. April 2017 um 17:27 Uhr
Von: "Joe Witt" <jo...@gmail.com>
An: dev@nifi.apache.org
Betreff: Re: 4 Processors for NiFi: SplitToAttribute, MergeTemplate, ExecuteRuleEngine, GenerateData
Uwe,

Please avoid cross posting to both dev and user lists. I recognize
that you'd like to let folks know these exist and that is fine. We
started up this wiki page to help folks just like you let others know
about processors they've built that they're happy to have others use:
https://cwiki.apache.org/confluence/display/NIFI/Community+Contributions
We can give you access to update that if you like.

For the CSV to attributes to processor one thing I wanted to mention
is that it would be good to document, if you've not already,
limitations this will create with regard to memory usage. Such an
approach takes the content of flowfiles and promotes it to attributes
of flowfiles. This will mean that a lot more than is often necessary
will be held in memory. This might be perfectly fine for many use
cases so it isn't a deal breaker but it is more of a notice you'd want
to give people. I replied with a bit more context on this to your
more recent email. Something you'll likely want to checkout will be
in the nifi 1.2.0 release which is a new RecordReader/RecordWriter
abstraction over flowfile content that people can use in new
processors like QueryFlowFile which will let you do things like
execute SQL queries over flowfile content which can be read in record
at a time. Their are record readers already for CSV, JSON, AVRO, and
Grok with writers in CSV, Avro, JSON, and generic text.

Thanks
Joe

On Fri, Apr 14, 2017 at 11:16 AM, Uwe Geercken <uw...@web.de> wrote:
> Hello everybody,
>
> I have released (Apache License) my NiFi processors at:
>
> https://github.com/uwegeercken/nifi_processors[https://github.com/uwegeercken/nifi_processors][https://deref-web-02.de/mail/client/WDRw2DECNaw/dereferrer/?redirectUrl=https%3A%2F%2Fgithub.com%2Fuwegeercken%2Fnifi_processors[https://deref-web-02.de/mail/client/WDRw2DECNaw/dereferrer/?redirectUrl=https%3A%2F%2Fgithub.com%2Fuwegeercken%2Fnifi_processors]]
>
> Further below is a summary for the processors. I would like to invite everybody to test, look at the source code and send me any feedback that you have.
>
> I have done a lot of testing but have not been able e.g. to test it in a cluster setup or with very large amounts of data. Also I am german native speaking - maybe some of the wording in the processors or documentation could be enhanced.
>
> Nifi rocks!
>
> Uwe
>
> ==========================
>
> Description of processors:
>
> 1.
> The SplitToAttribute processor for Apache Nifi will allow to split the incoming content (CSV) of a flowfile into separate fields using a defined separator.
> The values of the individual fields will be assigned to flowfile attributes. Each attribute is named using the defined field prefix plus the positional number of the field.
> A number format can optionally be specified to format the column number. The number format needs to be according to the Java DecimalFormat class.
> Example:
> A flow file with following content:
> Peterson, Jenny, New York, USA
> When the field prefix is set to "column_" and the field number format is set to "000" the result will be 4 attributes:
> column_000 = Peterson column_001 = Jenny column_002 = New York column_003 = USA
> Note that this processor can be used together with the MergeTemplate processor, which merges the flow file attributes with Apache Velocity templates.
>
> 2.
> The MergeTemplate processor for Apache Nifi will allow to merge the attributes from a flowfile with an Apache Velocity template. The Velocity template contains placeholders (e.g. $column0 - alternatively in brackets: ${column0}).
> In the merge process the attributes of the flowfile will be merged with the template and the placeholders are replaced with the attribute values.
> See the Apache Velocity website at http://velocity.apache.org[http://velocity.apache.org][https://deref-web-02.de/mail/client/mHBrImoMhLc/dereferrer/?redirectUrl=http%3A%2F%2Fvelocity.apache.org[https://deref-web-02.de/mail/client/mHBrImoMhLc/dereferrer/?redirectUrl=http%3A%2F%2Fvelocity.apache.org]] for details on the template engine.
> A filter (regular expression) has to be specified, defining which attributes shall be considered for the template engine.
> The original file will be routed to the "original" relation and the result of the merge process will replace the content of the flowfile and is routed to the "merged" relationship.
> Example:
> A flow file with following attributes:
> column0 = Peterson column1 = Jenny column2 = New York column3 = USA
> A template file "names.vm" with below format. Placeholders start with a dollar sign and are optionally in curly brackets:
> { "name": "$column0", "first": "$column1", "city": "$column2", "country": "$column3" }
> After the attributes are merged with the template, the placeholders in the template are replaced with the values from the flowfile attributes. This is the result:
> { "name": "Peterson", "first": "Jenny", "city": "New York", "country": "USA" }
> Can be used for any textual data formats such as CSV, HTML, XML, Json, etc.
>
> 3.
> The ExecuteRuleEngine processor allows to process rows of data from CSV files. It runs business rules against the data and then updates the flowfile attributes based on the results of the ruleengine. One can then route the flowfile based on these attributes.
> The processor requires to set the ruleengine project zip file and a separator. The project zip file can be created with the Business Rules Maintenance Tool - a web application to construct and orchestrate business logic (business rules). The projects from the web app can be exported and used with the RuleEngine processor. When the ruleengine runs, it splits the incomming row of data (the flowfile content) into individual fields. So the separator defines how the fields are separated from each other.
> The advantage of using this processor and a ruleengine is that the business logic can be defined outside of Nifi. And thus if the logic changes, the flow does NOT have to be changed. The ruleengine can be used to define complex business logic. E.g. "Lastname must be XXX and age must be greater than 25 and country must be Germany or France". This would be difficult to model in nifi and would clutter the flow. Managed in the web app the business logic can modified in an agile way and the flow in Nifi stays clean and lean.
> Note that there is a test project file: nifi_test2_dev.zip and it can be used with the test data file: allCountries_100.txt. If you use these files, you won't need to install/use the Business Rules Maintenance Tool web application.
>
> 4.
> The GenerateData processor generates random data from word lists, regular expressions or purely random. The output is in CSV format. This is useful if you want to generate mass data, but mass data which makes sense (from wordlists e.g.). Also there are some nice features for generating dependent date columns.
>
>

Re: 4 Processors for NiFi: SplitToAttribute, MergeTemplate, ExecuteRuleEngine, GenerateData

Posted by Joe Witt <jo...@gmail.com>.
Uwe,

Please avoid cross posting to both dev and user lists.  I recognize
that you'd like to let folks know these exist and that is fine.  We
started up this wiki page to help folks just like you let others know
about processors they've built that they're happy to have others use:
https://cwiki.apache.org/confluence/display/NIFI/Community+Contributions
 We can give you access to update that if you like.

For the CSV to attributes to processor one thing I wanted to mention
is that it would be good to document, if you've not already,
limitations this will create with regard to memory usage.  Such an
approach takes the content of flowfiles and promotes it to attributes
of flowfiles.  This will mean that a lot more than is often necessary
will be held in memory.  This might be perfectly fine for many use
cases so it isn't a deal breaker but it is more of a notice you'd want
to give people.  I replied with a bit more context on this to your
more recent email.  Something you'll likely want to checkout will be
in the nifi 1.2.0 release which is a new RecordReader/RecordWriter
abstraction over flowfile content that people can use in new
processors like QueryFlowFile which will let you do things like
execute SQL queries over flowfile content which can be read in record
at a time.  Their are record readers already for CSV, JSON, AVRO, and
Grok with writers in CSV, Avro, JSON, and generic text.

Thanks
Joe

On Fri, Apr 14, 2017 at 11:16 AM, Uwe Geercken <uw...@web.de> wrote:
> Hello everybody,
>
> I have released (Apache License) my NiFi processors at:
>
> https://github.com/uwegeercken/nifi_processors[https://deref-web-02.de/mail/client/WDRw2DECNaw/dereferrer/?redirectUrl=https%3A%2F%2Fgithub.com%2Fuwegeercken%2Fnifi_processors]
>
> Further below is a summary for the processors. I would like to invite everybody to test, look at the source code and send me any feedback that you have.
>
> I have done a lot of testing but have not been able e.g. to test it in a cluster setup or with very large amounts of data. Also I am german native speaking - maybe some of the wording in the processors or documentation could be enhanced.
>
> Nifi rocks!
>
> Uwe
>
> ==========================
>
> Description of processors:
>
> 1.
> The SplitToAttribute processor for Apache Nifi will allow to split the incoming content (CSV) of a flowfile into separate fields using a defined separator.
> The values of the individual fields will be assigned to flowfile attributes. Each attribute is named using the defined field prefix plus the positional number of the field.
> A number format can optionally be specified to format the column number. The number format needs to be according to the Java DecimalFormat class.
> Example:
> A flow file with following content:
> Peterson, Jenny, New York, USA
> When the field prefix is set to "column_" and the field number format is set to "000" the result will be 4 attributes:
> column_000 = Peterson column_001 = Jenny column_002 = New York column_003 = USA
> Note that this processor can be used together with the MergeTemplate processor, which merges the flow file attributes with Apache Velocity templates.
>
> 2.
> The MergeTemplate processor for Apache Nifi will allow to merge the attributes from a flowfile with an Apache Velocity template. The Velocity template contains placeholders (e.g. $column0 - alternatively in brackets: ${column0}).
> In the merge process the attributes of the flowfile will be merged with the template and the placeholders are replaced with the attribute values.
> See the Apache Velocity website at http://velocity.apache.org[https://deref-web-02.de/mail/client/mHBrImoMhLc/dereferrer/?redirectUrl=http%3A%2F%2Fvelocity.apache.org] for details on the template engine.
> A filter (regular expression) has to be specified, defining which attributes shall be considered for the template engine.
> The original file will be routed to the "original" relation and the result of the merge process will replace the content of the flowfile and is routed to the "merged" relationship.
> Example:
> A flow file with following attributes:
> column0 = Peterson column1 = Jenny column2 = New York column3 = USA
> A template file "names.vm" with below format. Placeholders start with a dollar sign and are optionally in curly brackets:
> { "name": "$column0", "first": "$column1", "city": "$column2", "country": "$column3" }
> After the attributes are merged with the template, the placeholders in the template are replaced with the values from the flowfile attributes. This is the result:
> { "name": "Peterson", "first": "Jenny", "city": "New York", "country": "USA" }
> Can be used for any textual data formats such as CSV, HTML, XML, Json, etc.
>
> 3.
> The ExecuteRuleEngine processor allows to process rows of data from CSV files. It runs business rules against the data and then updates the flowfile attributes based on the results of the ruleengine. One can then route the flowfile based on these attributes.
> The processor requires to set the ruleengine project zip file and a separator. The project zip file can be created with the Business Rules Maintenance Tool - a web application to construct and orchestrate business logic (business rules). The projects from the web app can be exported and used with the RuleEngine processor. When the ruleengine runs, it splits the incomming row of data (the flowfile content) into individual fields. So the separator defines how the fields are separated from each other.
> The advantage of using this processor and a ruleengine is that the business logic can be defined outside of Nifi. And thus if the logic changes, the flow does NOT have to be changed. The ruleengine can be used to define complex business logic. E.g. "Lastname must be XXX and age must be greater than 25 and country must be Germany or France". This would be difficult to model in nifi and would clutter the flow. Managed in the web app the business logic can modified in an agile way and the flow in Nifi stays clean and lean.
> Note that there is a test project file: nifi_test2_dev.zip and it can be used with the test data file: allCountries_100.txt. If you use these files, you won't need to install/use the Business Rules Maintenance Tool web application.
>
> 4.
> The GenerateData processor generates random data from word lists, regular expressions or purely random. The output is in CSV format. This is useful if you want to generate mass data, but mass data which makes sense (from wordlists e.g.). Also there are some nice features for generating dependent date columns.
>
>