You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nifi.apache.org by Dave <da...@gmail.com> on 2018/06/18 04:41:07 UTC

SplitText - How to make each split unique?

Hi,

I am learning NiFi.

I have created an input csv (CityCode.csv) file as below:
ID, CITY_NAME, ZIP_CD, STATE_CD
1, Delhi, 110001, DL
2, Mumbai, 400001, MH
3, Chennai, 600001, TN
4, Bangalore, 560001, KA

This is my 1st dataflow. I am building it block by block and I am planning
to create a dataflow like this.
GetFile -> InferAvroSchema -> SplitText -> ConvertCSVToAvro -> ExtractText
-> if error Put in Kafka
-> if success put in DB
I might add few more functionalities in between to strengthen my knowledge.

InitialFlow.jpg
<http://apache-nifi-developer-list.39713.n7.nabble.com/file/t1006/InitialFlow.jpg>

I have created dataflow till ConvertCSVToAvro. I have a few queries in the
flow till now

I use Getfile processor to take a csv file from a directory
D:\ApacheNiFi\source-data. If getfile is successful, then the flow moves to
“CreateInferAvroSchema”
In InferAvroSchema processor, the flow is configured as below:

• Schema Output Destination - flowfile-attribute
• Input Content Type - CSV
• CSV Header Definition -
• Get CSV Header Definition From Data - true
• CSV Header Line Skip Count – 1
• CSV delimiter – .
• CSV Escape String - /
• CSV Quote String – ‘
• Pretty Avro Output - true
• Avro Record Name - CityCode
• Numer of Records To Analyze - 10
• Charset – UTF8

Scheduling
Scheduling Strategy - Timer Driven, Concurrent Tasks – 1, Run Schedule – 0
sec
Settings
• I have checked original Relationship to Automatically Terminate
Relationships because I am not able to understand what exactly is this
relationship
• Failure & Unsupported content – Put in file in directory
“D:\ApacheNiFi\error-data”
• Success – SplitText

The reason why I used SplitText processor before InferAvroSchema processor
is that the schema processor is not able to capture records which are only
failure but send the whole file and add an attribute “error” to failed
records. In one specific post, it was recommended to first split the records
and then convert to avro
https://stackoverflow.com/questions/41840726/nifi-convertcsvtoavro-how-to-capture-the-failed-records
<https://stackoverflow.com/questions/41840726/nifi-convertcsvtoavro-how-to-capture-the-failed-records>

In SplitText Processor, the flow is configured as below:
Line Split Count - 1
Header Line Count - 1 (This I have kept as 1 because I have a header
in my file)
Remove Trailing Newlines - true

Splits - It flows to next processor “ConvertCSVToAvro”
Original - I have created a processor Putfile and storing the file in a
directory by name "D:\ApacheNiFi\processed-data".
Failure - I am routing it to the same processor

1st question:
Is it possible that we can attach some kind of an attribute to distinguish
every record that is split. For eg. Is it possible to attach some unique ID
to each record as an attribute to make it unique? If yes, how can I do that?
Is there any instructions or material available where it will help me to add
an attribute? I tried to add “UpdateAttribute” processor to check if I can
achieve this, but could not find anything related.

2nd question:
I also need to check if the input string in each field of the record is of
35 characters. Only then it should execute the “Split” relation. Else the
record should be routed to failure.

Any guidance will be very helpful. I hope I am not sounding very stupid.

If there is any material for me to practise these kind of activities like
validating based on some conditions or mentioning a filename for capturing
error records like "InvalidRecords.csv" in the folder mentioned in putfile
processor. Everything seems so confusing and I am not able to find enough
material to learn this.

Thanks for your patience and time

Thanks
Dave

--
Sent from: http://apache-nifi-developer-list.39713.n7.nabble.com/

Re: SplitText - How to make each split unique?

Posted by Pierre Villard <pi...@gmail.com>.

Hello,

If you're data is only CSV, you might want to look at ValidateCSV processor.
Using QueryRecord processor would also give you options to validate your
data with your own constraints.

Pierre


2018-06-18 14:52 GMT+02:00 Bryan Bende <bb...@gmail.com>:

> Hello,
>
> In general you probably want to take a look at the "record" processors
> which will offer a more efficient way of performing this task without
> needing to split to 1 message per flow file.
>
> The flow with the record processors would probably be GetFile ->
> ConvertRecord (using CsvReader and AvroWriter) -> PublishKafkaRecord
>
> Regarding your specific questions...
>
> 1) All split processors write a standard set of "fragment" attributes
> which you can read about in the documentation of the processor. The
> fragment.identifier will be a unique id for the overall flow file and
> then fragment.index will be the index of the split with in the given
> fragment.identifier.
>
> 2) I think you will need to write a custom script or processor for
> this validation part. I suppose there could be a generic
> ValidateFieldLength processor, but it doesn't seem like a common case,
> and it only applies to fields that are strings which is a small
> sub-set of the possible types.
>
> -Bryan
>
>
>
> On Mon, Jun 18, 2018 at 12:41 AM, Dave <da...@gmail.com> wrote:
> > Hi,
> >
> > I am learning NiFi.
> >
> > I have created an input csv (CityCode.csv) file as below:
> > ID,      CITY_NAME,      ZIP_CD,         STATE_CD
> > 1,      Delhi,          110001, DL
> > 2,      Mumbai, 400001, MH
> > 3,      Chennai,        600001, TN
> > 4,      Bangalore,      560001, KA
> >
> > This is my 1st dataflow. I am building it block by block and I am
> planning
> > to create a dataflow like this.
> > GetFile -> InferAvroSchema -> SplitText -> ConvertCSVToAvro ->
> ExtractText
> > -> if error Put in Kafka
> >
>                                  -> if success put in DB
> > I might add few more functionalities in between to strengthen my
> knowledge.
> >
> > InitialFlow.jpg
> > <http://apache-nifi-developer-list.39713.n7.nabble.com/file/
> t1006/InitialFlow.jpg>
> >
> > I have created dataflow till ConvertCSVToAvro. I have a few queries in
> the
> > flow till now
> >
> > I use Getfile processor to take a csv file from a directory
> > D:\ApacheNiFi\source-data. If getfile is successful, then the flow moves
> to
> > “CreateInferAvroSchema”
> > In InferAvroSchema processor, the flow is configured as below:
> >
> > •       Schema Output Destination - flowfile-attribute
> > •       Input Content Type - CSV
> > •       CSV Header Definition -
> > •       Get CSV Header Definition From Data - true
> > •       CSV Header Line Skip Count – 1
> > •       CSV delimiter –  .
> > •       CSV Escape String -  /
> > •       CSV Quote String – ‘
> > •       Pretty Avro Output - true
> > •       Avro Record Name - CityCode
> > •       Numer of Records To Analyze - 10
> > •       Charset – UTF8
> >
> > Scheduling
> > Scheduling Strategy - Timer Driven,  Concurrent Tasks – 1, Run Schedule
> – 0
> > sec
> > Settings
> > •       I have checked original Relationship to Automatically Terminate
> > Relationships because I am not able to understand what exactly is this
> > relationship
> > •       Failure & Unsupported content – Put in file in directory
> > “D:\ApacheNiFi\error-data”
> > •       Success – SplitText
> >
> >  The reason why I used SplitText processor before InferAvroSchema
> processor
> > is that the schema processor is not able to capture records which are
> only
> > failure but send the whole file and add an attribute “error” to failed
> > records. In one specific post, it was recommended to first split the
> records
> > and then convert to avro
> > https://stackoverflow.com/questions/41840726/nifi-
> convertcsvtoavro-how-to-capture-the-failed-records
> > <https://stackoverflow.com/questions/41840726/nifi-
> convertcsvtoavro-how-to-capture-the-failed-records>
> >
> > In SplitText Processor, the flow is configured as below:
> > Line Split Count        - 1
> > Header Line Count  - 1       (This I have kept as 1 because I have a
> header
> > in my file)
> > Remove Trailing Newlines -  true
> >
> > Splits - It flows to next processor “ConvertCSVToAvro”
> > Original - I have created a processor Putfile and storing the file in a
> > directory by name "D:\ApacheNiFi\processed-data".
> > Failure - I am routing it to the same processor
> >
> > 1st question:
> > Is it possible that we can attach some kind of an attribute to
> distinguish
> > every record that is split. For eg. Is it possible to attach some unique
> ID
> > to each record as an attribute to make it unique? If yes, how can I do
> that?
> > Is there any instructions or material available where it will help me to
> add
> > an attribute?  I tried to add “UpdateAttribute” processor to check if I
> can
> > achieve this, but could not find anything related.
> >
> > 2nd question:
> > I also need to check if the input string in each field of the record is
> of
> > 35 characters. Only then it should execute the “Split” relation. Else the
> > record should be routed to failure.
> >
> > Any guidance will be very helpful. I hope I am not sounding very stupid.
> >
> > If there is any material for me to practise these kind of activities like
> > validating based on some conditions or mentioning a filename for
> capturing
> > error records like "InvalidRecords.csv" in the folder mentioned in
> putfile
> > processor. Everything seems so confusing and I am not able to find enough
> > material to learn this.
> >
> > Thanks for your patience and time
> >
> > Thanks
> > Dave
> >
> >
> >
> > --
> > Sent from: http://apache-nifi-developer-list.39713.n7.nabble.com/
>

Re: SplitText - How to make each split unique?

Posted by Dave <da...@gmail.com>.

Thanks.. I will surely take some time to go through the suggested topics and
then proceed.



--
Sent from: http://apache-nifi-developer-list.39713.n7.nabble.com/

Re: SplitText - How to make each split unique?

Posted by Andy LoPresto <al...@apache.org>.

Dave,

It sounds like many of the issues you are having with specific processors can be answered with a combination of mailing list questions and the processor documentation. I would also encourage you to read the Getting Started Guide [1], User Guide [2], and Admin Guide [3]. They are long, but really offer a good overview of the entire system. Once you understand the framework (which can be complicated), a lot of the little pieces which can seem weird start to fall into place.

There are also a number of good independent blogs (Bryan Bende [4], Pierre Villard [5], Matt Burgess [6]) which give step-by-step instructions on various topics, and some excellent overview videos on YouTube by Jenn Barnabee [7].

[1] https://nifi.apache.org/docs/nifi-docs/html/getting-started.html
[2] https://nifi.apache.org/docs/nifi-docs/html/user-guide.html
[3] https://nifi.apache.org/docs/nifi-docs/html/administration-guide.html
[4] https://bryanbende.com/ <https://bryanbende.com/>
[5] https://pierrevillard.com/ <https://pierrevillard.com/>
[6] https://funnifi.blogspot.com/ <https://funnifi.blogspot.com/>
[7] https://www.youtube.com/playlist?list=PLHre9pIBAgc4e-tiq9OIXkWJX8bVXuqlG <https://www.youtube.com/playlist?list=PLHre9pIBAgc4e-tiq9OIXkWJX8bVXuqlG>


Andy LoPresto
alopresto@apache.org
alopresto.apache@gmail.com
PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69

> On Jun 18, 2018, at 10:28 AM, Dave <da...@gmail.com> wrote:
> 
> Thanks.. I will try to implement this suggestion also.
> 
> Actually since I earlier have managed a Datawarehouse project, I am trying
> to pick up scenarios based on that experience. I have actually visualised
> this specific scenario.
> 
> The source file will be a CSV file. And all the fields will be be stored as
> strings in the csv file(Similar to the source files in a DWH project). Once
> the fields are validated for their length, then the data will be stored in
> DB table with a unique ID to keep a track of failure at a later stage.. My
> plan is to store the data  in the appropriate format at this point with the
> unique ID.. for eg) ID and Zipcode to be numeric.
> 
> Once I am able to achieve the above I would like to add a few more fields
> later like date, some descriptions of the city also will be added. The
> description will also have invalid characters. After this I would like to
> implement another set of validations like date format, remove invalid
> characters etc. and then store in a different DB.
> 
> At every stage I want to also capture the error records and either publish
> them or store in a different DB.
> 
> This is goal for now.
> 
> Thanks
> Dave
> 
> 
> 
> --
> Sent from: http://apache-nifi-developer-list.39713.n7.nabble.com/

Re: SplitText - How to make each split unique?

Posted by Dave <da...@gmail.com>.

Thanks.. I will try to implement this suggestion also.

Actually since I earlier have managed a Datawarehouse project, I am trying
to pick up scenarios based on that experience. I have actually visualised
this specific scenario.

The source file will be a CSV file. And all the fields will be be stored as
strings in the csv file(Similar to the source files in a DWH project). Once
the fields are validated for their length, then the data will be stored in
DB table with a unique ID to keep a track of failure at a later stage.. My
plan is to store the data  in the appropriate format at this point with the
unique ID.. for eg) ID and Zipcode to be numeric. 

Once I am able to achieve the above I would like to add a few more fields
later like date, some descriptions of the city also will be added. The
description will also have invalid characters. After this I would like to
implement another set of validations like date format, remove invalid
characters etc. and then store in a different DB.

At every stage I want to also capture the error records and either publish
them or store in a different DB. 

This is goal for now. 

Thanks
Dave



--
Sent from: http://apache-nifi-developer-list.39713.n7.nabble.com/

Re: SplitText - How to make each split unique?

Posted by Bryan Bende <bb...@gmail.com>.

Hello,

In general you probably want to take a look at the "record" processors
which will offer a more efficient way of performing this task without
needing to split to 1 message per flow file.

The flow with the record processors would probably be GetFile ->
ConvertRecord (using CsvReader and AvroWriter) -> PublishKafkaRecord

Regarding your specific questions...

1) All split processors write a standard set of "fragment" attributes
which you can read about in the documentation of the processor. The
fragment.identifier will be a unique id for the overall flow file and
then fragment.index will be the index of the split with in the given
fragment.identifier.

2) I think you will need to write a custom script or processor for
this validation part. I suppose there could be a generic
ValidateFieldLength processor, but it doesn't seem like a common case,
and it only applies to fields that are strings which is a small
sub-set of the possible types.

-Bryan



On Mon, Jun 18, 2018 at 12:41 AM, Dave <da...@gmail.com> wrote:
> Hi,
>
> I am learning NiFi.
>
> I have created an input csv (CityCode.csv) file as below:
> ID,      CITY_NAME,      ZIP_CD,         STATE_CD
> 1,      Delhi,          110001, DL
> 2,      Mumbai, 400001, MH
> 3,      Chennai,        600001, TN
> 4,      Bangalore,      560001, KA
>
> This is my 1st dataflow. I am building it block by block and I am planning
> to create a dataflow like this.
> GetFile -> InferAvroSchema -> SplitText -> ConvertCSVToAvro -> ExtractText
> -> if error Put in Kafka
>                                                                                                           -> if success put in DB
> I might add few more functionalities in between to strengthen my knowledge.
>
> InitialFlow.jpg
> <http://apache-nifi-developer-list.39713.n7.nabble.com/file/t1006/InitialFlow.jpg>
>
> I have created dataflow till ConvertCSVToAvro. I have a few queries in the
> flow till now
>
> I use Getfile processor to take a csv file from a directory
> D:\ApacheNiFi\source-data. If getfile is successful, then the flow moves to
> “CreateInferAvroSchema”
> In InferAvroSchema processor, the flow is configured as below:
>
> •       Schema Output Destination - flowfile-attribute
> •       Input Content Type - CSV
> •       CSV Header Definition -
> •       Get CSV Header Definition From Data - true
> •       CSV Header Line Skip Count – 1
> •       CSV delimiter –  .
> •       CSV Escape String -  /
> •       CSV Quote String – ‘
> •       Pretty Avro Output - true
> •       Avro Record Name - CityCode
> •       Numer of Records To Analyze - 10
> •       Charset – UTF8
>
> Scheduling
> Scheduling Strategy - Timer Driven,  Concurrent Tasks – 1, Run Schedule – 0
> sec
> Settings
> •       I have checked original Relationship to Automatically Terminate
> Relationships because I am not able to understand what exactly is this
> relationship
> •       Failure & Unsupported content – Put in file in directory
> “D:\ApacheNiFi\error-data”
> •       Success – SplitText
>
>  The reason why I used SplitText processor before InferAvroSchema processor
> is that the schema processor is not able to capture records which are only
> failure but send the whole file and add an attribute “error” to failed
> records. In one specific post, it was recommended to first split the records
> and then convert to avro
> https://stackoverflow.com/questions/41840726/nifi-convertcsvtoavro-how-to-capture-the-failed-records
> <https://stackoverflow.com/questions/41840726/nifi-convertcsvtoavro-how-to-capture-the-failed-records>
>
> In SplitText Processor, the flow is configured as below:
> Line Split Count        - 1
> Header Line Count  - 1       (This I have kept as 1 because I have a header
> in my file)
> Remove Trailing Newlines -  true
>
> Splits - It flows to next processor “ConvertCSVToAvro”
> Original - I have created a processor Putfile and storing the file in a
> directory by name "D:\ApacheNiFi\processed-data".
> Failure - I am routing it to the same processor
>
> 1st question:
> Is it possible that we can attach some kind of an attribute to distinguish
> every record that is split. For eg. Is it possible to attach some unique ID
> to each record as an attribute to make it unique? If yes, how can I do that?
> Is there any instructions or material available where it will help me to add
> an attribute?  I tried to add “UpdateAttribute” processor to check if I can
> achieve this, but could not find anything related.
>
> 2nd question:
> I also need to check if the input string in each field of the record is of
> 35 characters. Only then it should execute the “Split” relation. Else the
> record should be routed to failure.
>
> Any guidance will be very helpful. I hope I am not sounding very stupid.
>
> If there is any material for me to practise these kind of activities like
> validating based on some conditions or mentioning a filename for capturing
> error records like "InvalidRecords.csv" in the folder mentioned in putfile
> processor. Everything seems so confusing and I am not able to find enough
> material to learn this.
>
> Thanks for your patience and time
>
> Thanks
> Dave
>
>
>
> --
> Sent from: http://apache-nifi-developer-list.39713.n7.nabble.com/

Re: SplitText - How to make each split unique?

Posted by Dave <da...@gmail.com>.

Thank you.. I will try this and the other solution offered.

Thanks



--
Sent from: http://apache-nifi-developer-list.39713.n7.nabble.com/