You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@nifi.apache.org by Venkat Williams <ve...@gmail.com> on 2017/06/06 07:12:04 UTC

Create a CovertCSVToJSON Processor

I want to contribute this processor implementation code to NIFI project.

*Requirements:*

1) Convert CSV files to a standard/canonical JSON format

a. One JSON object/document per row in the input CSV

b. Format should encode the data as JSON fields and values

c. JSON Field names should be the original column header with any invalid
characters handled properly.

d. Values should be kept unaltered

2) Optionally, be able to specify an expected header used to
validate/reject input CSVs

3) Support both tab and comma delimited files

a. Auto-detect based on header row is easy

b. Allow operator to specify the delimiter as a way to override the
auto-detect logic

4) Handle arbitrarily large files...

a. should handle CSV files of any length ( achieve this using
batching)

5) Handle errors gracefully

a. File failures

b. Row failures

6) Support for RFC-4180 <https://tools.ietf.org/html/rfc4180> formatted
CSV files and be sure to handle edge cases like embedded newlines in a
field value and escaped double quotes

Example:

Input CSV:

user,source_ip,source_country,destination_ip,url,timestamp

Venkat,192.168.0.1,IN,23.246.97.82,
http://www.google.com,2017-02-22T14:46:24-05:00
<http://www.google.com%2C2017-02-22t14:46:24-05:0/>

Desired output JSON:

{"user":"Venkat","source_ip":"192.168.0.1","source_country":"IN","destination_ip":"23.246.97.82","url":"
http://www.google.com","timestamp":"2017-02-22T14:46:24-05:00"}
<http://www.google.com/>

*Implementation:*

1) Reviewed all the existing csv libraries which can be used to
transform csv record to json document by supporting RFC-4180
<https://tools.ietf.org/html/rfc4180> standard to handle embedded new lines
in field value and escaped quotes. Found OpenCSV, FastCSV, UnivocityCSV
Libraries can do this job most effectively.

2) Selected Univocity CSV Library as I can do most of validations
which are part of my requirements only using this library. When I did the
performance testing using 5 GB and 10GB arbitrarily large files this gave
better results compared any others.

3) Processed CSV Records are being emitted immediately rather than
waiting complete file processing. Used some configurable number in
processor to wait until that many records to emit. With this approach I
could process 5GB CSV data records using 1GB NIFI RAM which is most
effective / attractive feature in this whole implementation to handle large
files. ( This is common limitation in most of processors like SplitText,
SplitXML, etc wait until whole file processing and stores the results
FlowFile ArrayList within the processor this cause heap size/outofmemory
issues)

4) Handled File errors and record errors gracefully using user defined
configurations and processor routes.

Can anyone suggest how to proceed further whether I have to open new issue
or if I have to use any existing issue. ( I don't find any which matches to
this requirement)

Re: Create a CovertCSVToJSON Processor

Posted by Mark Payne <ma...@hotmail.com>.

Venkat,

I actually started work on NIFI-3921 today. It touches a lot of things because there is such much now built upon the record readers abs writers. So I just need to be very diligent in my testing to ensure that I don't break anything. Hopefully will have a PR up later this week.

Thanks
-Mark

Sent from my iPhone

On Jun 6, 2017, at 9:04 PM, Venkat Williams <ve...@gmail.com>> wrote:

Thanks Mark for the valuable inputs.

SplitRecord is way to handle multiline records and NIFI-3921 helps us to avoid needing schema when we can use the CSV header row itself as schema.

Anyone working on the NIFI-3921 issue if not I can take it up.

Regards,
Venkat

On Tue, Jun 6, 2017 at 10:06 PM, Mark Payne <ma...@hotmail.com>> wrote:
Venkat,

If you do need to split the data up, there is now a SplitRecord processor that you can use to accomplish that with the readers and writers.
So that won't have problems with CSV fields that span multiple lines.

Unfortunately at this time, the writer does require that a schema registry be used to designate the schema. For most cases, this is fairly
easy to do, but it is a step that we should be able to skip all together. There already exists a JIRA [1] to update the readers/writers so that
the Record Writer can just inherit the schema that is provided by the Record Reader. Once this has been done, the CSV Reader should
be able to create the schema based on the CSV Header, and then pass that along to the record writer.

Thanks
-Mark

[1] https://issues.apache.org/jira/browse/NIFI-3921


On Jun 6, 2017, at 12:12 PM, Venkat Williams <ve...@gmail.com>> wrote:

Hi Joe and Mark,

Thanks a lot for your prompt response.

I wasn't able to able consider SplitText because CSV Records field values can fall in to next line with embedded newlines, escaped
double-quotes, etc. So I have rule out any logic related to Split.

Another question is it possible to convert CSV data json without specifying any schema just by considering CSV file first row as header and build schema internally using the header. If I don't specify schema registry I am getting 'schema access strategy' is invalid.

Thanks,
Venkat

On Tue, Jun 6, 2017 at 9:29 PM, Joe Witt <jo...@gmail.com>> wrote:
Venkat,

The only heap issues that could be consider common are if you're doing
'SplitText' and trying to go from hundreds of thousands or millions of
lines files to a single line output in a single processor.  You can
easily overcome that by doing a two phase split where the first
processor cuts into say 1000 line chunks and the next one does single
line chunks.  That said, with this record approach it doesn't have
that problem at all so the only cause for memory issues there would be
if any single record is so large that it takes up all the memory which
doesn't appear likely for your examples.

Thanks

On Tue, Jun 6, 2017 at 11:49 AM, Venkat Williams
<ve...@gmail.com>> wrote:
> Thanks Mark for helping me to build a template and test Covert CSV to JSON
> processing.
>
> I want to know is it possible to emit transformed records as it is to next
> processor rather than waiting for full file processing and keep the entire
> result in single flowfile.
>
> Input:
> id,topic,hits
> Rahul,scala,120
> Nikita,spark,80
> Mithun,spark,1
> myself,cca175,180
>
> Actual Output:
> [{"id":"Rahul","topic":"scala","hits":120},{"id":"Nikita","topic":"spark","hits":80},{"id":"Mithun","topic":"spark","hits":1},{"id":"myself","topic":"cca175","hits":180}]
>
> Expected output:(multiple flow files like split result)
> {"id":"Rahul","topic":"scala","hits":120}
> {"id":"Nikita","topic":"spark","hits":80}
> {"id":"Mithun","topic":"spark","hits":1}
> {"id":"myself","topic":"cca175","hits":180}
>
> By doing this I can overcome Heap/outofmemory issues which are so common.
> (scenario. have limited NIFI 1 GB RAM want to process 5 GB input data)
>
> Regards,
> Venkat
>
> On Tue, Jun 6, 2017 at 8:32 PM, Mark Payne <ma...@hotmail.com>> wrote:
>>
>> Hi Venkat,
>>
>> I just published a blog post [1] on running SQL in NiFi. The post walks
>> through creating a CSV Record Reader,
>> running SQL over the data, and then writing the results in JSON. This may
>> be helpful to you. In your case,
>> you may want to just use the ConvertRecord processor, rather than
>> QueryRecord, but the concepts of creating
>> the Record Reader and Writer are the same. This post references another
>> post [2] that I wrote a week or two ago
>> that gives a bit more details on how to actually create the reader and
>> writer.
>>
>> The CSV Reader uses Apache Commons CSV, so it will support RFC-4180,
>> embedded newlines, escaped
>> double-quotes, etc.
>>
>> I hope this helps give some direction in how to handle this in NiFi.
>>
>> Thanks
>> -Mark
>>
>> [1] https://blogs.apache.org/nifi/entry/real-time-sql-on-event
>> [2] https://blogs.apache.org/nifi/entry/record-oriented-data-with-nifi
>>
>>
>> On Jun 6, 2017, at 9:52 AM, Venkat Williams <ve...@gmail.com>>
>> wrote:
>>
>> Hi Joe Witt,
>>
>> Thanks for your response.
>>
>> I heard and read about about these record readers but not quite got it how
>> to use them using some test data or template. It will be great if you can
>> help me to get some working example or flow.
>>
>> I want to know if these implementations support for RFC-4180 formatted CSV
>> files and be sure to handle edge cases like embedded newlines in a field
>> value and escaped double quotes.
>>
>> Thanks for your help advance.
>>
>> Regards,
>> Venkat
>>
>> On Tue, Jun 6, 2017 at 7:07 PM, Joe Witt <jo...@gmail.com>> wrote:
>>>
>>> Venkat
>>>
>>> I think you'll want to take a closer look at the apache nifi 1.2.0
>>> release support for record readers and record writers.  It handles
>>> schema aware parsing/transformation and more for things like csv,
>>> json, avro, can be easily extended, and supports scripted readers and
>>> writers written right there through the UI.  As it is new examples are
>>> still emerging but we can certainly help you along.
>>>
>>> Thanks
>>> Joe
>>>
>>> On Tue, Jun 6, 2017 at 3:12 AM, Venkat Williams
>>> <ve...@gmail.com>> wrote:
>>> > Hi
>>> >
>>> >
>>> >
>>> > I want to contribute this processor implementation code to NIFI
>>> > project.
>>> >
>>> >
>>> >
>>> > Requirements:
>>> >
>>> >
>>> >
>>> > 1)     Convert CSV files to a standard/canonical JSON format
>>> >
>>> > a.       One JSON object/document per row in the input CSV
>>> >
>>> > b.      Format should encode the data as JSON fields and values
>>> >
>>> > c.       JSON Field names should be the original column header with any
>>> > invalid characters handled properly.
>>> >
>>> > d.      Values should be kept unaltered
>>> >
>>> > 2)     Optionally, be able to specify an expected header used to
>>> > validate/reject input CSVs
>>> >
>>> > 3)     Support both tab and comma delimited files
>>> >
>>> > a.     Auto-detect based on header row is easy
>>> >
>>> > b.    Allow operator to specify the delimiter as a way to override the
>>> > auto-detect logic
>>> >
>>> > 4)     Handle arbitrarily large files...
>>> >
>>> > a.       should handle CSV files of any length ( achieve this using
>>> > batching)
>>> >
>>> > 5)     Handle errors gracefully
>>> >
>>> > a.       File failures
>>> >
>>> > b.      Row failures
>>> >
>>> > 6)     Support for RFC-4180 formatted CSV files and be sure to handle
>>> > edge
>>> > cases like embedded newlines in a field value and escaped double quotes
>>> >
>>> >
>>> >
>>> > Example:
>>> >
>>> > Input CSV:
>>> >
>>> > user,source_ip,source_country,destination_ip,url,timestamp
>>> >
>>> >
>>> > Venkat,192.168.0.1,IN,23.246.97.82,http://www.google.com<http://www.google.com/>,2017-02-22T14:46:24-05:00
>>> >
>>> >
>>> >
>>> > Desired output JSON:
>>> >
>>> >
>>> > {"user":"Venkat","source_ip":"192.168.0.1","source_country":"IN","destination_ip":"23.246.97.82","url":"http://www.google.com<http://www.google.com/>","timestamp":"2017-02-22T14:46:24-05:00"}
>>> >
>>> >
>>> >
>>> > Implementation:
>>> >
>>> > 1)      Reviewed all the existing csv libraries which can be used to
>>> > transform csv record to json document by supporting  RFC-4180 standard
>>> > to
>>> > handle embedded new lines in field value and escaped quotes. Found
>>> > OpenCSV,
>>> > FastCSV, UnivocityCSV Libraries can do this job most effectively.
>>> >
>>> > 2)      Selected Univocity CSV Library as I can do most of validations
>>> > which
>>> > are part of my requirements only using this library. When I did the
>>> > performance testing using 5 GB and 10GB arbitrarily large files this
>>> > gave
>>> > better results compared any others.
>>> >
>>> > 3)      Processed CSV Records are being emitted immediately rather than
>>> > waiting complete file processing. Used some configurable number in
>>> > processor
>>> > to wait until that many records to emit. With this approach I could
>>> > process
>>> > 5GB CSV data records using 1GB NIFI RAM which is most effective /
>>> > attractive
>>> > feature in this whole implementation to handle large files. ( This is
>>> > common
>>> > limitation in most of processors like SplitText, SplitXML, etc wait
>>> > until
>>> > whole file processing and stores the results FlowFile ArrayList within
>>> > the
>>> > processor this cause heap size/outofmemory issues)
>>> >
>>> > 4) Handled File errors and record errors gracefully using user defined
>>> > configurations and processor routes.
>>> >
>>> > Can anyone suggest how to proceed further whether I have to open new
>>> > issue
>>> > or if I have to use any existing issue. ( I don't find any which
>>> > matches to
>>> > this requirement)
>>
>>
>>
>

Re: Create a CovertCSVToJSON Processor

Posted by Venkat Williams <ve...@gmail.com>.

Thanks Mark for the valuable inputs.

SplitRecord is way to handle multiline records and NIFI-3921 helps us to
avoid needing schema when we can use the CSV header row itself as schema.

Anyone working on the NIFI-3921 issue if not I can take it up.

Regards,
Venkat

On Tue, Jun 6, 2017 at 10:06 PM, Mark Payne <ma...@hotmail.com> wrote:

> Venkat,
>
> If you do need to split the data up, there is now a SplitRecord processor
> that you can use to accomplish that with the readers and writers.
> So that won't have problems with CSV fields that span multiple lines.
>
> Unfortunately at this time, the writer does require that a schema registry
> be used to designate the schema. For most cases, this is fairly
> easy to do, but it is a step that we should be able to skip all together.
> There already exists a JIRA [1] to update the readers/writers so that
> the Record Writer can just inherit the schema that is provided by the
> Record Reader. Once this has been done, the CSV Reader should
> be able to create the schema based on the CSV Header, and then pass that
> along to the record writer.
>
> Thanks
> -Mark
>
> [1] https://issues.apache.org/jira/browse/NIFI-3921
>
>
> On Jun 6, 2017, at 12:12 PM, Venkat Williams <ve...@gmail.com>
> wrote:
>
> Hi Joe and Mark,
>
> Thanks a lot for your prompt response.
>
> I wasn't able to able consider SplitText because CSV Records field values
> can fall in to next line with embedded newlines, escaped
> double-quotes, etc. So I have rule out any logic related to Split.
>
> Another question is it possible to convert CSV data json without
> specifying any schema just by considering CSV file first row as header and
> build schema internally using the header. If I don't specify schema
> registry I am getting 'schema access strategy' is invalid.
>
> Thanks,
> Venkat
>
> On Tue, Jun 6, 2017 at 9:29 PM, Joe Witt <jo...@gmail.com> wrote:
>
>> Venkat,
>>
>> The only heap issues that could be consider common are if you're doing
>> 'SplitText' and trying to go from hundreds of thousands or millions of
>> lines files to a single line output in a single processor.  You can
>> easily overcome that by doing a two phase split where the first
>> processor cuts into say 1000 line chunks and the next one does single
>> line chunks.  That said, with this record approach it doesn't have
>> that problem at all so the only cause for memory issues there would be
>> if any single record is so large that it takes up all the memory which
>> doesn't appear likely for your examples.
>>
>> Thanks
>>
>> On Tue, Jun 6, 2017 at 11:49 AM, Venkat Williams
>> <ve...@gmail.com> wrote:
>> > Thanks Mark for helping me to build a template and test Covert CSV to
>> JSON
>> > processing.
>> >
>> > I want to know is it possible to emit transformed records as it is to
>> next
>> > processor rather than waiting for full file processing and keep the
>> entire
>> > result in single flowfile.
>> >
>> > Input:
>> > id,topic,hits
>> > Rahul,scala,120
>> > Nikita,spark,80
>> > Mithun,spark,1
>> > myself,cca175,180
>> >
>> > Actual Output:
>> > [{"id":"Rahul","topic":"scala","hits":120},{"id":"Nikita","t
>> opic":"spark","hits":80},{"id":"Mithun","topic":"spark","hit
>> s":1},{"id":"myself","topic":"cca175","hits":180}]
>> >
>> > Expected output:(multiple flow files like split result)
>> > {"id":"Rahul","topic":"scala","hits":120}
>> > {"id":"Nikita","topic":"spark","hits":80}
>> > {"id":"Mithun","topic":"spark","hits":1}
>> > {"id":"myself","topic":"cca175","hits":180}
>> >
>> > By doing this I can overcome Heap/outofmemory issues which are so
>> common.
>> > (scenario. have limited NIFI 1 GB RAM want to process 5 GB input data)
>> >
>> > Regards,
>> > Venkat
>> >
>> > On Tue, Jun 6, 2017 at 8:32 PM, Mark Payne <ma...@hotmail.com>
>> wrote:
>> >>
>> >> Hi Venkat,
>> >>
>> >> I just published a blog post [1] on running SQL in NiFi. The post walks
>> >> through creating a CSV Record Reader,
>> >> running SQL over the data, and then writing the results in JSON. This
>> may
>> >> be helpful to you. In your case,
>> >> you may want to just use the ConvertRecord processor, rather than
>> >> QueryRecord, but the concepts of creating
>> >> the Record Reader and Writer are the same. This post references another
>> >> post [2] that I wrote a week or two ago
>> >> that gives a bit more details on how to actually create the reader and
>> >> writer.
>> >>
>> >> The CSV Reader uses Apache Commons CSV, so it will support RFC-4180,
>> >> embedded newlines, escaped
>> >> double-quotes, etc.
>> >>
>> >> I hope this helps give some direction in how to handle this in NiFi.
>> >>
>> >> Thanks
>> >> -Mark
>> >>
>> >> [1] https://blogs.apache.org/nifi/entry/real-time-sql-on-event
>> >> [2] https://blogs.apache.org/nifi/entry/record-oriented-data-with-nifi
>> >>
>> >>
>> >> On Jun 6, 2017, at 9:52 AM, Venkat Williams <venkat.williams@gmail.com
>> >
>> >> wrote:
>> >>
>> >> Hi Joe Witt,
>> >>
>> >> Thanks for your response.
>> >>
>> >> I heard and read about about these record readers but not quite got it
>> how
>> >> to use them using some test data or template. It will be great if you
>> can
>> >> help me to get some working example or flow.
>> >>
>> >> I want to know if these implementations support for RFC-4180 formatted
>> CSV
>> >> files and be sure to handle edge cases like embedded newlines in a
>> field
>> >> value and escaped double quotes.
>> >>
>> >> Thanks for your help advance.
>> >>
>> >> Regards,
>> >> Venkat
>> >>
>> >> On Tue, Jun 6, 2017 at 7:07 PM, Joe Witt <jo...@gmail.com> wrote:
>> >>>
>> >>> Venkat
>> >>>
>> >>> I think you'll want to take a closer look at the apache nifi 1.2.0
>> >>> release support for record readers and record writers.  It handles
>> >>> schema aware parsing/transformation and more for things like csv,
>> >>> json, avro, can be easily extended, and supports scripted readers and
>> >>> writers written right there through the UI.  As it is new examples are
>> >>> still emerging but we can certainly help you along.
>> >>>
>> >>> Thanks
>> >>> Joe
>> >>>
>> >>> On Tue, Jun 6, 2017 at 3:12 AM, Venkat Williams
>> >>> <ve...@gmail.com> wrote:
>> >>> > Hi
>> >>> >
>> >>> >
>> >>> >
>> >>> > I want to contribute this processor implementation code to NIFI
>> >>> > project.
>> >>> >
>> >>> >
>> >>> >
>> >>> > Requirements:
>> >>> >
>> >>> >
>> >>> >
>> >>> > 1)     Convert CSV files to a standard/canonical JSON format
>> >>> >
>> >>> > a.       One JSON object/document per row in the input CSV
>> >>> >
>> >>> > b.      Format should encode the data as JSON fields and values
>> >>> >
>> >>> > c.       JSON Field names should be the original column header with
>> any
>> >>> > invalid characters handled properly.
>> >>> >
>> >>> > d.      Values should be kept unaltered
>> >>> >
>> >>> > 2)     Optionally, be able to specify an expected header used to
>> >>> > validate/reject input CSVs
>> >>> >
>> >>> > 3)     Support both tab and comma delimited files
>> >>> >
>> >>> > a.     Auto-detect based on header row is easy
>> >>> >
>> >>> > b.    Allow operator to specify the delimiter as a way to override
>> the
>> >>> > auto-detect logic
>> >>> >
>> >>> > 4)     Handle arbitrarily large files...
>> >>> >
>> >>> > a.       should handle CSV files of any length ( achieve this using
>> >>> > batching)
>> >>> >
>> >>> > 5)     Handle errors gracefully
>> >>> >
>> >>> > a.       File failures
>> >>> >
>> >>> > b.      Row failures
>> >>> >
>> >>> > 6)     Support for RFC-4180 formatted CSV files and be sure to
>> handle
>> >>> > edge
>> >>> > cases like embedded newlines in a field value and escaped double
>> quotes
>> >>> >
>> >>> >
>> >>> >
>> >>> > Example:
>> >>> >
>> >>> > Input CSV:
>> >>> >
>> >>> > user,source_ip,source_country,destination_ip,url,timestamp
>> >>> >
>> >>> >
>> >>> > Venkat,192.168.0.1,IN,23.246.97.82,http://www.google.com,201
>> 7-02-22T14:46:24-05:00
>> >>> >
>> >>> >
>> >>> >
>> >>> > Desired output JSON:
>> >>> >
>> >>> >
>> >>> > {"user":"Venkat","source_ip":"192.168.0.1","source_country":
>> "IN","destination_ip":"23.246.97.82","url":"http://www.google.com
>> ","timestamp":"2017-02-22T14:46:24-05:00"}
>> >>> >
>> >>> >
>> >>> >
>> >>> > Implementation:
>> >>> >
>> >>> > 1)      Reviewed all the existing csv libraries which can be used to
>> >>> > transform csv record to json document by supporting  RFC-4180
>> standard
>> >>> > to
>> >>> > handle embedded new lines in field value and escaped quotes. Found
>> >>> > OpenCSV,
>> >>> > FastCSV, UnivocityCSV Libraries can do this job most effectively.
>> >>> >
>> >>> > 2)      Selected Univocity CSV Library as I can do most of
>> validations
>> >>> > which
>> >>> > are part of my requirements only using this library. When I did the
>> >>> > performance testing using 5 GB and 10GB arbitrarily large files this
>> >>> > gave
>> >>> > better results compared any others.
>> >>> >
>> >>> > 3)      Processed CSV Records are being emitted immediately rather
>> than
>> >>> > waiting complete file processing. Used some configurable number in
>> >>> > processor
>> >>> > to wait until that many records to emit. With this approach I could
>> >>> > process
>> >>> > 5GB CSV data records using 1GB NIFI RAM which is most effective /
>> >>> > attractive
>> >>> > feature in this whole implementation to handle large files. ( This
>> is
>> >>> > common
>> >>> > limitation in most of processors like SplitText, SplitXML, etc wait
>> >>> > until
>> >>> > whole file processing and stores the results FlowFile ArrayList
>> within
>> >>> > the
>> >>> > processor this cause heap size/outofmemory issues)
>> >>> >
>> >>> > 4) Handled File errors and record errors gracefully using user
>> defined
>> >>> > configurations and processor routes.
>> >>> >
>> >>> > Can anyone suggest how to proceed further whether I have to open new
>> >>> > issue
>> >>> > or if I have to use any existing issue. ( I don't find any which
>> >>> > matches to
>> >>> > this requirement)
>> >>
>> >>
>> >>
>> >
>>
>
>
>

Re: Create a CovertCSVToJSON Processor

Posted by Venkat Williams <ve...@gmail.com>.

Thanks Matt that sounds good intermediate solution as of now..

Regards,
Venkat


On Jun 6, 2017 22:16, "Matt Burgess" <ma...@gmail.com> wrote:

> Venkat,
>
> In the meantime, i have a Groovy script for ExecuteScript [1] that will
> read the header and create an Avro schema (stored in the avro.schema
> attribute) so you can set the access strategy to Use Schema Text. It works
> like the Use Header Fields strategy in CSVReader, meaning all fields are
> assumed to be strings.
>
> Regards,
> Matt
>
> [1] https://gist.github.com/mattyb149/6c9ac2d0961b8ff38ad716646f45b073
>
> On Jun 6, 2017, at 12:36 PM, Mark Payne <ma...@hotmail.com> wrote:
>
> Venkat,
>
> If you do need to split the data up, there is now a SplitRecord processor
> that you can use to accomplish that with the readers and writers.
> So that won't have problems with CSV fields that span multiple lines.
>
> Unfortunately at this time, the writer does require that a schema registry
> be used to designate the schema. For most cases, this is fairly
> easy to do, but it is a step that we should be able to skip all together.
> There already exists a JIRA [1] to update the readers/writers so that
> the Record Writer can just inherit the schema that is provided by the
> Record Reader. Once this has been done, the CSV Reader should
> be able to create the schema based on the CSV Header, and then pass that
> along to the record writer.
>
> Thanks
> -Mark
>
> [1] https://issues.apache.org/jira/browse/NIFI-3921
>
>
> On Jun 6, 2017, at 12:12 PM, Venkat Williams <ve...@gmail.com>
> wrote:
>
> Hi Joe and Mark,
>
> Thanks a lot for your prompt response.
>
> I wasn't able to able consider SplitText because CSV Records field values
> can fall in to next line with embedded newlines, escaped
> double-quotes, etc. So I have rule out any logic related to Split.
>
> Another question is it possible to convert CSV data json without
> specifying any schema just by considering CSV file first row as header and
> build schema internally using the header. If I don't specify schema
> registry I am getting 'schema access strategy' is invalid.
>
> Thanks,
> Venkat
>
> On Tue, Jun 6, 2017 at 9:29 PM, Joe Witt <jo...@gmail.com> wrote:
>
>> Venkat,
>>
>> The only heap issues that could be consider common are if you're doing
>> 'SplitText' and trying to go from hundreds of thousands or millions of
>> lines files to a single line output in a single processor.  You can
>> easily overcome that by doing a two phase split where the first
>> processor cuts into say 1000 line chunks and the next one does single
>> line chunks.  That said, with this record approach it doesn't have
>> that problem at all so the only cause for memory issues there would be
>> if any single record is so large that it takes up all the memory which
>> doesn't appear likely for your examples.
>>
>> Thanks
>>
>> On Tue, Jun 6, 2017 at 11:49 AM, Venkat Williams
>> <ve...@gmail.com> wrote:
>> > Thanks Mark for helping me to build a template and test Covert CSV to
>> JSON
>> > processing.
>> >
>> > I want to know is it possible to emit transformed records as it is to
>> next
>> > processor rather than waiting for full file processing and keep the
>> entire
>> > result in single flowfile.
>> >
>> > Input:
>> > id,topic,hits
>> > Rahul,scala,120
>> > Nikita,spark,80
>> > Mithun,spark,1
>> > myself,cca175,180
>> >
>> > Actual Output:
>> > [{"id":"Rahul","topic":"scala","hits":120},{"id":"Nikita","t
>> opic":"spark","hits":80},{"id":"Mithun","topic":"spark","hit
>> s":1},{"id":"myself","topic":"cca175","hits":180}]
>> >
>> > Expected output:(multiple flow files like split result)
>> > {"id":"Rahul","topic":"scala","hits":120}
>> > {"id":"Nikita","topic":"spark","hits":80}
>> > {"id":"Mithun","topic":"spark","hits":1}
>> > {"id":"myself","topic":"cca175","hits":180}
>> >
>> > By doing this I can overcome Heap/outofmemory issues which are so
>> common.
>> > (scenario. have limited NIFI 1 GB RAM want to process 5 GB input data)
>> >
>> > Regards,
>> > Venkat
>> >
>> > On Tue, Jun 6, 2017 at 8:32 PM, Mark Payne <ma...@hotmail.com>
>> wrote:
>> >>
>> >> Hi Venkat,
>> >>
>> >> I just published a blog post [1] on running SQL in NiFi. The post walks
>> >> through creating a CSV Record Reader,
>> >> running SQL over the data, and then writing the results in JSON. This
>> may
>> >> be helpful to you. In your case,
>> >> you may want to just use the ConvertRecord processor, rather than
>> >> QueryRecord, but the concepts of creating
>> >> the Record Reader and Writer are the same. This post references another
>> >> post [2] that I wrote a week or two ago
>> >> that gives a bit more details on how to actually create the reader and
>> >> writer.
>> >>
>> >> The CSV Reader uses Apache Commons CSV, so it will support RFC-4180,
>> >> embedded newlines, escaped
>> >> double-quotes, etc.
>> >>
>> >> I hope this helps give some direction in how to handle this in NiFi.
>> >>
>> >> Thanks
>> >> -Mark
>> >>
>> >> [1] https://blogs.apache.org/nifi/entry/real-time-sql-on-event
>> >> [2] https://blogs.apache.org/nifi/entry/record-oriented-data-with-nifi
>> >>
>> >>
>> >> On Jun 6, 2017, at 9:52 AM, Venkat Williams <venkat.williams@gmail.com
>> >
>> >> wrote:
>> >>
>> >> Hi Joe Witt,
>> >>
>> >> Thanks for your response.
>> >>
>> >> I heard and read about about these record readers but not quite got it
>> how
>> >> to use them using some test data or template. It will be great if you
>> can
>> >> help me to get some working example or flow.
>> >>
>> >> I want to know if these implementations support for RFC-4180 formatted
>> CSV
>> >> files and be sure to handle edge cases like embedded newlines in a
>> field
>> >> value and escaped double quotes.
>> >>
>> >> Thanks for your help advance.
>> >>
>> >> Regards,
>> >> Venkat
>> >>
>> >> On Tue, Jun 6, 2017 at 7:07 PM, Joe Witt <jo...@gmail.com> wrote:
>> >>>
>> >>> Venkat
>> >>>
>> >>> I think you'll want to take a closer look at the apache nifi 1.2.0
>> >>> release support for record readers and record writers.  It handles
>> >>> schema aware parsing/transformation and more for things like csv,
>> >>> json, avro, can be easily extended, and supports scripted readers and
>> >>> writers written right there through the UI.  As it is new examples are
>> >>> still emerging but we can certainly help you along.
>> >>>
>> >>> Thanks
>> >>> Joe
>> >>>
>> >>> On Tue, Jun 6, 2017 at 3:12 AM, Venkat Williams
>> >>> <ve...@gmail.com> wrote:
>> >>> > Hi
>> >>> >
>> >>> >
>> >>> >
>> >>> > I want to contribute this processor implementation code to NIFI
>> >>> > project.
>> >>> >
>> >>> >
>> >>> >
>> >>> > Requirements:
>> >>> >
>> >>> >
>> >>> >
>> >>> > 1)     Convert CSV files to a standard/canonical JSON format
>> >>> >
>> >>> > a.       One JSON object/document per row in the input CSV
>> >>> >
>> >>> > b.      Format should encode the data as JSON fields and values
>> >>> >
>> >>> > c.       JSON Field names should be the original column header with
>> any
>> >>> > invalid characters handled properly.
>> >>> >
>> >>> > d.      Values should be kept unaltered
>> >>> >
>> >>> > 2)     Optionally, be able to specify an expected header used to
>> >>> > validate/reject input CSVs
>> >>> >
>> >>> > 3)     Support both tab and comma delimited files
>> >>> >
>> >>> > a.     Auto-detect based on header row is easy
>> >>> >
>> >>> > b.    Allow operator to specify the delimiter as a way to override
>> the
>> >>> > auto-detect logic
>> >>> >
>> >>> > 4)     Handle arbitrarily large files...
>> >>> >
>> >>> > a.       should handle CSV files of any length ( achieve this using
>> >>> > batching)
>> >>> >
>> >>> > 5)     Handle errors gracefully
>> >>> >
>> >>> > a.       File failures
>> >>> >
>> >>> > b.      Row failures
>> >>> >
>> >>> > 6)     Support for RFC-4180 formatted CSV files and be sure to
>> handle
>> >>> > edge
>> >>> > cases like embedded newlines in a field value and escaped double
>> quotes
>> >>> >
>> >>> >
>> >>> >
>> >>> > Example:
>> >>> >
>> >>> > Input CSV:
>> >>> >
>> >>> > user,source_ip,source_country,destination_ip,url,timestamp
>> >>> >
>> >>> >
>> >>> > Venkat,192.168.0.1,IN,23.246.97.82,http://www.google.com,201
>> 7-02-22T14:46:24-05:00
>> >>> >
>> >>> >
>> >>> >
>> >>> > Desired output JSON:
>> >>> >
>> >>> >
>> >>> > {"user":"Venkat","source_ip":"192.168.0.1","source_country":
>> "IN","destination_ip":"23.246.97.82","url":"http://www.google.com
>> ","timestamp":"2017-02-22T14:46:24-05:00"}
>> >>> >
>> >>> >
>> >>> >
>> >>> > Implementation:
>> >>> >
>> >>> > 1)      Reviewed all the existing csv libraries which can be used to
>> >>> > transform csv record to json document by supporting  RFC-4180
>> standard
>> >>> > to
>> >>> > handle embedded new lines in field value and escaped quotes. Found
>> >>> > OpenCSV,
>> >>> > FastCSV, UnivocityCSV Libraries can do this job most effectively.
>> >>> >
>> >>> > 2)      Selected Univocity CSV Library as I can do most of
>> validations
>> >>> > which
>> >>> > are part of my requirements only using this library. When I did the
>> >>> > performance testing using 5 GB and 10GB arbitrarily large files this
>> >>> > gave
>> >>> > better results compared any others.
>> >>> >
>> >>> > 3)      Processed CSV Records are being emitted immediately rather
>> than
>> >>> > waiting complete file processing. Used some configurable number in
>> >>> > processor
>> >>> > to wait until that many records to emit. With this approach I could
>> >>> > process
>> >>> > 5GB CSV data records using 1GB NIFI RAM which is most effective /
>> >>> > attractive
>> >>> > feature in this whole implementation to handle large files. ( This
>> is
>> >>> > common
>> >>> > limitation in most of processors like SplitText, SplitXML, etc wait
>> >>> > until
>> >>> > whole file processing and stores the results FlowFile ArrayList
>> within
>> >>> > the
>> >>> > processor this cause heap size/outofmemory issues)
>> >>> >
>> >>> > 4) Handled File errors and record errors gracefully using user
>> defined
>> >>> > configurations and processor routes.
>> >>> >
>> >>> > Can anyone suggest how to proceed further whether I have to open new
>> >>> > issue
>> >>> > or if I have to use any existing issue. ( I don't find any which
>> >>> > matches to
>> >>> > this requirement)
>> >>
>> >>
>> >>
>> >
>>
>
>
>

Re: Create a CovertCSVToJSON Processor

Posted by Matt Burgess <ma...@gmail.com>.

Venkat,

In the meantime, i have a Groovy script for ExecuteScript [1] that will read the header and create an Avro schema (stored in the avro.schema attribute) so you can set the access strategy to Use Schema Text. It works like the Use Header Fields strategy in CSVReader, meaning all fields are assumed to be strings.

Regards,
Matt

[1] https://gist.github.com/mattyb149/6c9ac2d0961b8ff38ad716646f45b073

> On Jun 6, 2017, at 12:36 PM, Mark Payne <ma...@hotmail.com> wrote:
> 
> Venkat,
> 
> If you do need to split the data up, there is now a SplitRecord processor that you can use to accomplish that with the readers and writers.
> So that won't have problems with CSV fields that span multiple lines.
> 
> Unfortunately at this time, the writer does require that a schema registry be used to designate the schema. For most cases, this is fairly
> easy to do, but it is a step that we should be able to skip all together. There already exists a JIRA [1] to update the readers/writers so that
> the Record Writer can just inherit the schema that is provided by the Record Reader. Once this has been done, the CSV Reader should
> be able to create the schema based on the CSV Header, and then pass that along to the record writer.
> 
> Thanks
> -Mark
> 
> [1] https://issues.apache.org/jira/browse/NIFI-3921
> 
> 
>> On Jun 6, 2017, at 12:12 PM, Venkat Williams <ve...@gmail.com> wrote:
>> 
>> Hi Joe and Mark,
>> 
>> Thanks a lot for your prompt response.
>> 
>> I wasn't able to able consider SplitText because CSV Records field values can fall in to next line with embedded newlines, escaped
>> double-quotes, etc. So I have rule out any logic related to Split. 
>> 
>> Another question is it possible to convert CSV data json without specifying any schema just by considering CSV file first row as header and build schema internally using the header. If I don't specify schema registry I am getting 'schema access strategy' is invalid.
>> 
>> Thanks,
>> Venkat
>> 
>>> On Tue, Jun 6, 2017 at 9:29 PM, Joe Witt <jo...@gmail.com> wrote:
>>> Venkat,
>>> 
>>> The only heap issues that could be consider common are if you're doing
>>> 'SplitText' and trying to go from hundreds of thousands or millions of
>>> lines files to a single line output in a single processor.  You can
>>> easily overcome that by doing a two phase split where the first
>>> processor cuts into say 1000 line chunks and the next one does single
>>> line chunks.  That said, with this record approach it doesn't have
>>> that problem at all so the only cause for memory issues there would be
>>> if any single record is so large that it takes up all the memory which
>>> doesn't appear likely for your examples.
>>> 
>>> Thanks
>>> 
>>> On Tue, Jun 6, 2017 at 11:49 AM, Venkat Williams
>>> <ve...@gmail.com> wrote:
>>> > Thanks Mark for helping me to build a template and test Covert CSV to JSON
>>> > processing.
>>> >
>>> > I want to know is it possible to emit transformed records as it is to next
>>> > processor rather than waiting for full file processing and keep the entire
>>> > result in single flowfile.
>>> >
>>> > Input:
>>> > id,topic,hits
>>> > Rahul,scala,120
>>> > Nikita,spark,80
>>> > Mithun,spark,1
>>> > myself,cca175,180
>>> >
>>> > Actual Output:
>>> > [{"id":"Rahul","topic":"scala","hits":120},{"id":"Nikita","topic":"spark","hits":80},{"id":"Mithun","topic":"spark","hits":1},{"id":"myself","topic":"cca175","hits":180}]
>>> >
>>> > Expected output:(multiple flow files like split result)
>>> > {"id":"Rahul","topic":"scala","hits":120}
>>> > {"id":"Nikita","topic":"spark","hits":80}
>>> > {"id":"Mithun","topic":"spark","hits":1}
>>> > {"id":"myself","topic":"cca175","hits":180}
>>> >
>>> > By doing this I can overcome Heap/outofmemory issues which are so common.
>>> > (scenario. have limited NIFI 1 GB RAM want to process 5 GB input data)
>>> >
>>> > Regards,
>>> > Venkat
>>> >
>>> > On Tue, Jun 6, 2017 at 8:32 PM, Mark Payne <ma...@hotmail.com> wrote:
>>> >>
>>> >> Hi Venkat,
>>> >>
>>> >> I just published a blog post [1] on running SQL in NiFi. The post walks
>>> >> through creating a CSV Record Reader,
>>> >> running SQL over the data, and then writing the results in JSON. This may
>>> >> be helpful to you. In your case,
>>> >> you may want to just use the ConvertRecord processor, rather than
>>> >> QueryRecord, but the concepts of creating
>>> >> the Record Reader and Writer are the same. This post references another
>>> >> post [2] that I wrote a week or two ago
>>> >> that gives a bit more details on how to actually create the reader and
>>> >> writer.
>>> >>
>>> >> The CSV Reader uses Apache Commons CSV, so it will support RFC-4180,
>>> >> embedded newlines, escaped
>>> >> double-quotes, etc.
>>> >>
>>> >> I hope this helps give some direction in how to handle this in NiFi.
>>> >>
>>> >> Thanks
>>> >> -Mark
>>> >>
>>> >> [1] https://blogs.apache.org/nifi/entry/real-time-sql-on-event
>>> >> [2] https://blogs.apache.org/nifi/entry/record-oriented-data-with-nifi
>>> >>
>>> >>
>>> >> On Jun 6, 2017, at 9:52 AM, Venkat Williams <ve...@gmail.com>
>>> >> wrote:
>>> >>
>>> >> Hi Joe Witt,
>>> >>
>>> >> Thanks for your response.
>>> >>
>>> >> I heard and read about about these record readers but not quite got it how
>>> >> to use them using some test data or template. It will be great if you can
>>> >> help me to get some working example or flow.
>>> >>
>>> >> I want to know if these implementations support for RFC-4180 formatted CSV
>>> >> files and be sure to handle edge cases like embedded newlines in a field
>>> >> value and escaped double quotes.
>>> >>
>>> >> Thanks for your help advance.
>>> >>
>>> >> Regards,
>>> >> Venkat
>>> >>
>>> >> On Tue, Jun 6, 2017 at 7:07 PM, Joe Witt <jo...@gmail.com> wrote:
>>> >>>
>>> >>> Venkat
>>> >>>
>>> >>> I think you'll want to take a closer look at the apache nifi 1.2.0
>>> >>> release support for record readers and record writers.  It handles
>>> >>> schema aware parsing/transformation and more for things like csv,
>>> >>> json, avro, can be easily extended, and supports scripted readers and
>>> >>> writers written right there through the UI.  As it is new examples are
>>> >>> still emerging but we can certainly help you along.
>>> >>>
>>> >>> Thanks
>>> >>> Joe
>>> >>>
>>> >>> On Tue, Jun 6, 2017 at 3:12 AM, Venkat Williams
>>> >>> <ve...@gmail.com> wrote:
>>> >>> > Hi
>>> >>> >
>>> >>> >
>>> >>> >
>>> >>> > I want to contribute this processor implementation code to NIFI
>>> >>> > project.
>>> >>> >
>>> >>> >
>>> >>> >
>>> >>> > Requirements:
>>> >>> >
>>> >>> >
>>> >>> >
>>> >>> > 1)     Convert CSV files to a standard/canonical JSON format
>>> >>> >
>>> >>> > a.       One JSON object/document per row in the input CSV
>>> >>> >
>>> >>> > b.      Format should encode the data as JSON fields and values
>>> >>> >
>>> >>> > c.       JSON Field names should be the original column header with any
>>> >>> > invalid characters handled properly.
>>> >>> >
>>> >>> > d.      Values should be kept unaltered
>>> >>> >
>>> >>> > 2)     Optionally, be able to specify an expected header used to
>>> >>> > validate/reject input CSVs
>>> >>> >
>>> >>> > 3)     Support both tab and comma delimited files
>>> >>> >
>>> >>> > a.     Auto-detect based on header row is easy
>>> >>> >
>>> >>> > b.    Allow operator to specify the delimiter as a way to override the
>>> >>> > auto-detect logic
>>> >>> >
>>> >>> > 4)     Handle arbitrarily large files...
>>> >>> >
>>> >>> > a.       should handle CSV files of any length ( achieve this using
>>> >>> > batching)
>>> >>> >
>>> >>> > 5)     Handle errors gracefully
>>> >>> >
>>> >>> > a.       File failures
>>> >>> >
>>> >>> > b.      Row failures
>>> >>> >
>>> >>> > 6)     Support for RFC-4180 formatted CSV files and be sure to handle
>>> >>> > edge
>>> >>> > cases like embedded newlines in a field value and escaped double quotes
>>> >>> >
>>> >>> >
>>> >>> >
>>> >>> > Example:
>>> >>> >
>>> >>> > Input CSV:
>>> >>> >
>>> >>> > user,source_ip,source_country,destination_ip,url,timestamp
>>> >>> >
>>> >>> >
>>> >>> > Venkat,192.168.0.1,IN,23.246.97.82,http://www.google.com,2017-02-22T14:46:24-05:00
>>> >>> >
>>> >>> >
>>> >>> >
>>> >>> > Desired output JSON:
>>> >>> >
>>> >>> >
>>> >>> > {"user":"Venkat","source_ip":"192.168.0.1","source_country":"IN","destination_ip":"23.246.97.82","url":"http://www.google.com","timestamp":"2017-02-22T14:46:24-05:00"}
>>> >>> >
>>> >>> >
>>> >>> >
>>> >>> > Implementation:
>>> >>> >
>>> >>> > 1)      Reviewed all the existing csv libraries which can be used to
>>> >>> > transform csv record to json document by supporting  RFC-4180 standard
>>> >>> > to
>>> >>> > handle embedded new lines in field value and escaped quotes. Found
>>> >>> > OpenCSV,
>>> >>> > FastCSV, UnivocityCSV Libraries can do this job most effectively.
>>> >>> >
>>> >>> > 2)      Selected Univocity CSV Library as I can do most of validations
>>> >>> > which
>>> >>> > are part of my requirements only using this library. When I did the
>>> >>> > performance testing using 5 GB and 10GB arbitrarily large files this
>>> >>> > gave
>>> >>> > better results compared any others.
>>> >>> >
>>> >>> > 3)      Processed CSV Records are being emitted immediately rather than
>>> >>> > waiting complete file processing. Used some configurable number in
>>> >>> > processor
>>> >>> > to wait until that many records to emit. With this approach I could
>>> >>> > process
>>> >>> > 5GB CSV data records using 1GB NIFI RAM which is most effective /
>>> >>> > attractive
>>> >>> > feature in this whole implementation to handle large files. ( This is
>>> >>> > common
>>> >>> > limitation in most of processors like SplitText, SplitXML, etc wait
>>> >>> > until
>>> >>> > whole file processing and stores the results FlowFile ArrayList within
>>> >>> > the
>>> >>> > processor this cause heap size/outofmemory issues)
>>> >>> >
>>> >>> > 4) Handled File errors and record errors gracefully using user defined
>>> >>> > configurations and processor routes.
>>> >>> >
>>> >>> > Can anyone suggest how to proceed further whether I have to open new
>>> >>> > issue
>>> >>> > or if I have to use any existing issue. ( I don't find any which
>>> >>> > matches to
>>> >>> > this requirement)
>>> >>
>>> >>
>>> >>
>>> >
>> 
>

Re: Create a CovertCSVToJSON Processor

Posted by Mark Payne <ma...@hotmail.com>.

Venkat,

If you do need to split the data up, there is now a SplitRecord processor that you can use to accomplish that with the readers and writers.
So that won't have problems with CSV fields that span multiple lines.

Unfortunately at this time, the writer does require that a schema registry be used to designate the schema. For most cases, this is fairly
easy to do, but it is a step that we should be able to skip all together. There already exists a JIRA [1] to update the readers/writers so that
the Record Writer can just inherit the schema that is provided by the Record Reader. Once this has been done, the CSV Reader should
be able to create the schema based on the CSV Header, and then pass that along to the record writer.

Thanks
-Mark

[1] https://issues.apache.org/jira/browse/NIFI-3921


On Jun 6, 2017, at 12:12 PM, Venkat Williams <ve...@gmail.com>> wrote:

Hi Joe and Mark,

Thanks a lot for your prompt response.

I wasn't able to able consider SplitText because CSV Records field values can fall in to next line with embedded newlines, escaped
double-quotes, etc. So I have rule out any logic related to Split.

Another question is it possible to convert CSV data json without specifying any schema just by considering CSV file first row as header and build schema internally using the header. If I don't specify schema registry I am getting 'schema access strategy' is invalid.

Thanks,
Venkat

On Tue, Jun 6, 2017 at 9:29 PM, Joe Witt <jo...@gmail.com>> wrote:
Venkat,

The only heap issues that could be consider common are if you're doing
'SplitText' and trying to go from hundreds of thousands or millions of
lines files to a single line output in a single processor.  You can
easily overcome that by doing a two phase split where the first
processor cuts into say 1000 line chunks and the next one does single
line chunks.  That said, with this record approach it doesn't have
that problem at all so the only cause for memory issues there would be
if any single record is so large that it takes up all the memory which
doesn't appear likely for your examples.

Thanks

On Tue, Jun 6, 2017 at 11:49 AM, Venkat Williams
<ve...@gmail.com>> wrote:
> Thanks Mark for helping me to build a template and test Covert CSV to JSON
> processing.
>
> I want to know is it possible to emit transformed records as it is to next
> processor rather than waiting for full file processing and keep the entire
> result in single flowfile.
>
> Input:
> id,topic,hits
> Rahul,scala,120
> Nikita,spark,80
> Mithun,spark,1
> myself,cca175,180
>
> Actual Output:
> [{"id":"Rahul","topic":"scala","hits":120},{"id":"Nikita","topic":"spark","hits":80},{"id":"Mithun","topic":"spark","hits":1},{"id":"myself","topic":"cca175","hits":180}]
>
> Expected output:(multiple flow files like split result)
> {"id":"Rahul","topic":"scala","hits":120}
> {"id":"Nikita","topic":"spark","hits":80}
> {"id":"Mithun","topic":"spark","hits":1}
> {"id":"myself","topic":"cca175","hits":180}
>
> By doing this I can overcome Heap/outofmemory issues which are so common.
> (scenario. have limited NIFI 1 GB RAM want to process 5 GB input data)
>
> Regards,
> Venkat
>
> On Tue, Jun 6, 2017 at 8:32 PM, Mark Payne <ma...@hotmail.com>> wrote:
>>
>> Hi Venkat,
>>
>> I just published a blog post [1] on running SQL in NiFi. The post walks
>> through creating a CSV Record Reader,
>> running SQL over the data, and then writing the results in JSON. This may
>> be helpful to you. In your case,
>> you may want to just use the ConvertRecord processor, rather than
>> QueryRecord, but the concepts of creating
>> the Record Reader and Writer are the same. This post references another
>> post [2] that I wrote a week or two ago
>> that gives a bit more details on how to actually create the reader and
>> writer.
>>
>> The CSV Reader uses Apache Commons CSV, so it will support RFC-4180,
>> embedded newlines, escaped
>> double-quotes, etc.
>>
>> I hope this helps give some direction in how to handle this in NiFi.
>>
>> Thanks
>> -Mark
>>
>> [1] https://blogs.apache.org/nifi/entry/real-time-sql-on-event
>> [2] https://blogs.apache.org/nifi/entry/record-oriented-data-with-nifi
>>
>>
>> On Jun 6, 2017, at 9:52 AM, Venkat Williams <ve...@gmail.com>>
>> wrote:
>>
>> Hi Joe Witt,
>>
>> Thanks for your response.
>>
>> I heard and read about about these record readers but not quite got it how
>> to use them using some test data or template. It will be great if you can
>> help me to get some working example or flow.
>>
>> I want to know if these implementations support for RFC-4180 formatted CSV
>> files and be sure to handle edge cases like embedded newlines in a field
>> value and escaped double quotes.
>>
>> Thanks for your help advance.
>>
>> Regards,
>> Venkat
>>
>> On Tue, Jun 6, 2017 at 7:07 PM, Joe Witt <jo...@gmail.com>> wrote:
>>>
>>> Venkat
>>>
>>> I think you'll want to take a closer look at the apache nifi 1.2.0
>>> release support for record readers and record writers.  It handles
>>> schema aware parsing/transformation and more for things like csv,
>>> json, avro, can be easily extended, and supports scripted readers and
>>> writers written right there through the UI.  As it is new examples are
>>> still emerging but we can certainly help you along.
>>>
>>> Thanks
>>> Joe
>>>
>>> On Tue, Jun 6, 2017 at 3:12 AM, Venkat Williams
>>> <ve...@gmail.com>> wrote:
>>> > Hi
>>> >
>>> >
>>> >
>>> > I want to contribute this processor implementation code to NIFI
>>> > project.
>>> >
>>> >
>>> >
>>> > Requirements:
>>> >
>>> >
>>> >
>>> > 1)     Convert CSV files to a standard/canonical JSON format
>>> >
>>> > a.       One JSON object/document per row in the input CSV
>>> >
>>> > b.      Format should encode the data as JSON fields and values
>>> >
>>> > c.       JSON Field names should be the original column header with any
>>> > invalid characters handled properly.
>>> >
>>> > d.      Values should be kept unaltered
>>> >
>>> > 2)     Optionally, be able to specify an expected header used to
>>> > validate/reject input CSVs
>>> >
>>> > 3)     Support both tab and comma delimited files
>>> >
>>> > a.     Auto-detect based on header row is easy
>>> >
>>> > b.    Allow operator to specify the delimiter as a way to override the
>>> > auto-detect logic
>>> >
>>> > 4)     Handle arbitrarily large files...
>>> >
>>> > a.       should handle CSV files of any length ( achieve this using
>>> > batching)
>>> >
>>> > 5)     Handle errors gracefully
>>> >
>>> > a.       File failures
>>> >
>>> > b.      Row failures
>>> >
>>> > 6)     Support for RFC-4180 formatted CSV files and be sure to handle
>>> > edge
>>> > cases like embedded newlines in a field value and escaped double quotes
>>> >
>>> >
>>> >
>>> > Example:
>>> >
>>> > Input CSV:
>>> >
>>> > user,source_ip,source_country,destination_ip,url,timestamp
>>> >
>>> >
>>> > Venkat,192.168.0.1,IN,23.246.97.82,http://www.google.com<http://www.google.com/>,2017-02-22T14:46:24-05:00
>>> >
>>> >
>>> >
>>> > Desired output JSON:
>>> >
>>> >
>>> > {"user":"Venkat","source_ip":"192.168.0.1","source_country":"IN","destination_ip":"23.246.97.82","url":"http://www.google.com<http://www.google.com/>","timestamp":"2017-02-22T14:46:24-05:00"}
>>> >
>>> >
>>> >
>>> > Implementation:
>>> >
>>> > 1)      Reviewed all the existing csv libraries which can be used to
>>> > transform csv record to json document by supporting  RFC-4180 standard
>>> > to
>>> > handle embedded new lines in field value and escaped quotes. Found
>>> > OpenCSV,
>>> > FastCSV, UnivocityCSV Libraries can do this job most effectively.
>>> >
>>> > 2)      Selected Univocity CSV Library as I can do most of validations
>>> > which
>>> > are part of my requirements only using this library. When I did the
>>> > performance testing using 5 GB and 10GB arbitrarily large files this
>>> > gave
>>> > better results compared any others.
>>> >
>>> > 3)      Processed CSV Records are being emitted immediately rather than
>>> > waiting complete file processing. Used some configurable number in
>>> > processor
>>> > to wait until that many records to emit. With this approach I could
>>> > process
>>> > 5GB CSV data records using 1GB NIFI RAM which is most effective /
>>> > attractive
>>> > feature in this whole implementation to handle large files. ( This is
>>> > common
>>> > limitation in most of processors like SplitText, SplitXML, etc wait
>>> > until
>>> > whole file processing and stores the results FlowFile ArrayList within
>>> > the
>>> > processor this cause heap size/outofmemory issues)
>>> >
>>> > 4) Handled File errors and record errors gracefully using user defined
>>> > configurations and processor routes.
>>> >
>>> > Can anyone suggest how to proceed further whether I have to open new
>>> > issue
>>> > or if I have to use any existing issue. ( I don't find any which
>>> > matches to
>>> > this requirement)
>>
>>
>>
>

Re: Create a CovertCSVToJSON Processor

Posted by Venkat Williams <ve...@gmail.com>.

Hi Joe and Mark,

Thanks a lot for your prompt response.

I wasn't able to able consider SplitText because CSV Records field values
can fall in to next line with embedded newlines, escaped
double-quotes, etc. So I have rule out any logic related to Split.

Another question is it possible to convert CSV data json without specifying
any schema just by considering CSV file first row as header and build
schema internally using the header. If I don't specify schema registry I am
getting 'schema access strategy' is invalid.

Thanks,
Venkat

On Tue, Jun 6, 2017 at 9:29 PM, Joe Witt <jo...@gmail.com> wrote:

> Venkat,
>
> The only heap issues that could be consider common are if you're doing
> 'SplitText' and trying to go from hundreds of thousands or millions of
> lines files to a single line output in a single processor.  You can
> easily overcome that by doing a two phase split where the first
> processor cuts into say 1000 line chunks and the next one does single
> line chunks.  That said, with this record approach it doesn't have
> that problem at all so the only cause for memory issues there would be
> if any single record is so large that it takes up all the memory which
> doesn't appear likely for your examples.
>
> Thanks
>
> On Tue, Jun 6, 2017 at 11:49 AM, Venkat Williams
> <ve...@gmail.com> wrote:
> > Thanks Mark for helping me to build a template and test Covert CSV to
> JSON
> > processing.
> >
> > I want to know is it possible to emit transformed records as it is to
> next
> > processor rather than waiting for full file processing and keep the
> entire
> > result in single flowfile.
> >
> > Input:
> > id,topic,hits
> > Rahul,scala,120
> > Nikita,spark,80
> > Mithun,spark,1
> > myself,cca175,180
> >
> > Actual Output:
> > [{"id":"Rahul","topic":"scala","hits":120},{"id":"Nikita","
> topic":"spark","hits":80},{"id":"Mithun","topic":"spark","
> hits":1},{"id":"myself","topic":"cca175","hits":180}]
> >
> > Expected output:(multiple flow files like split result)
> > {"id":"Rahul","topic":"scala","hits":120}
> > {"id":"Nikita","topic":"spark","hits":80}
> > {"id":"Mithun","topic":"spark","hits":1}
> > {"id":"myself","topic":"cca175","hits":180}
> >
> > By doing this I can overcome Heap/outofmemory issues which are so common.
> > (scenario. have limited NIFI 1 GB RAM want to process 5 GB input data)
> >
> > Regards,
> > Venkat
> >
> > On Tue, Jun 6, 2017 at 8:32 PM, Mark Payne <ma...@hotmail.com> wrote:
> >>
> >> Hi Venkat,
> >>
> >> I just published a blog post [1] on running SQL in NiFi. The post walks
> >> through creating a CSV Record Reader,
> >> running SQL over the data, and then writing the results in JSON. This
> may
> >> be helpful to you. In your case,
> >> you may want to just use the ConvertRecord processor, rather than
> >> QueryRecord, but the concepts of creating
> >> the Record Reader and Writer are the same. This post references another
> >> post [2] that I wrote a week or two ago
> >> that gives a bit more details on how to actually create the reader and
> >> writer.
> >>
> >> The CSV Reader uses Apache Commons CSV, so it will support RFC-4180,
> >> embedded newlines, escaped
> >> double-quotes, etc.
> >>
> >> I hope this helps give some direction in how to handle this in NiFi.
> >>
> >> Thanks
> >> -Mark
> >>
> >> [1] https://blogs.apache.org/nifi/entry/real-time-sql-on-event
> >> [2] https://blogs.apache.org/nifi/entry/record-oriented-data-with-nifi
> >>
> >>
> >> On Jun 6, 2017, at 9:52 AM, Venkat Williams <ve...@gmail.com>
> >> wrote:
> >>
> >> Hi Joe Witt,
> >>
> >> Thanks for your response.
> >>
> >> I heard and read about about these record readers but not quite got it
> how
> >> to use them using some test data or template. It will be great if you
> can
> >> help me to get some working example or flow.
> >>
> >> I want to know if these implementations support for RFC-4180 formatted
> CSV
> >> files and be sure to handle edge cases like embedded newlines in a field
> >> value and escaped double quotes.
> >>
> >> Thanks for your help advance.
> >>
> >> Regards,
> >> Venkat
> >>
> >> On Tue, Jun 6, 2017 at 7:07 PM, Joe Witt <jo...@gmail.com> wrote:
> >>>
> >>> Venkat
> >>>
> >>> I think you'll want to take a closer look at the apache nifi 1.2.0
> >>> release support for record readers and record writers.  It handles
> >>> schema aware parsing/transformation and more for things like csv,
> >>> json, avro, can be easily extended, and supports scripted readers and
> >>> writers written right there through the UI.  As it is new examples are
> >>> still emerging but we can certainly help you along.
> >>>
> >>> Thanks
> >>> Joe
> >>>
> >>> On Tue, Jun 6, 2017 at 3:12 AM, Venkat Williams
> >>> <ve...@gmail.com> wrote:
> >>> > Hi
> >>> >
> >>> >
> >>> >
> >>> > I want to contribute this processor implementation code to NIFI
> >>> > project.
> >>> >
> >>> >
> >>> >
> >>> > Requirements:
> >>> >
> >>> >
> >>> >
> >>> > 1)     Convert CSV files to a standard/canonical JSON format
> >>> >
> >>> > a.       One JSON object/document per row in the input CSV
> >>> >
> >>> > b.      Format should encode the data as JSON fields and values
> >>> >
> >>> > c.       JSON Field names should be the original column header with
> any
> >>> > invalid characters handled properly.
> >>> >
> >>> > d.      Values should be kept unaltered
> >>> >
> >>> > 2)     Optionally, be able to specify an expected header used to
> >>> > validate/reject input CSVs
> >>> >
> >>> > 3)     Support both tab and comma delimited files
> >>> >
> >>> > a.     Auto-detect based on header row is easy
> >>> >
> >>> > b.    Allow operator to specify the delimiter as a way to override
> the
> >>> > auto-detect logic
> >>> >
> >>> > 4)     Handle arbitrarily large files...
> >>> >
> >>> > a.       should handle CSV files of any length ( achieve this using
> >>> > batching)
> >>> >
> >>> > 5)     Handle errors gracefully
> >>> >
> >>> > a.       File failures
> >>> >
> >>> > b.      Row failures
> >>> >
> >>> > 6)     Support for RFC-4180 formatted CSV files and be sure to handle
> >>> > edge
> >>> > cases like embedded newlines in a field value and escaped double
> quotes
> >>> >
> >>> >
> >>> >
> >>> > Example:
> >>> >
> >>> > Input CSV:
> >>> >
> >>> > user,source_ip,source_country,destination_ip,url,timestamp
> >>> >
> >>> >
> >>> > Venkat,192.168.0.1,IN,23.246.97.82,http://www.google.com,
> 2017-02-22T14:46:24-05:00
> >>> >
> >>> >
> >>> >
> >>> > Desired output JSON:
> >>> >
> >>> >
> >>> > {"user":"Venkat","source_ip":"192.168.0.1","source_country":
> "IN","destination_ip":"23.246.97.82","url":"http://www.google.com
> ","timestamp":"2017-02-22T14:46:24-05:00"}
> >>> >
> >>> >
> >>> >
> >>> > Implementation:
> >>> >
> >>> > 1)      Reviewed all the existing csv libraries which can be used to
> >>> > transform csv record to json document by supporting  RFC-4180
> standard
> >>> > to
> >>> > handle embedded new lines in field value and escaped quotes. Found
> >>> > OpenCSV,
> >>> > FastCSV, UnivocityCSV Libraries can do this job most effectively.
> >>> >
> >>> > 2)      Selected Univocity CSV Library as I can do most of
> validations
> >>> > which
> >>> > are part of my requirements only using this library. When I did the
> >>> > performance testing using 5 GB and 10GB arbitrarily large files this
> >>> > gave
> >>> > better results compared any others.
> >>> >
> >>> > 3)      Processed CSV Records are being emitted immediately rather
> than
> >>> > waiting complete file processing. Used some configurable number in
> >>> > processor
> >>> > to wait until that many records to emit. With this approach I could
> >>> > process
> >>> > 5GB CSV data records using 1GB NIFI RAM which is most effective /
> >>> > attractive
> >>> > feature in this whole implementation to handle large files. ( This is
> >>> > common
> >>> > limitation in most of processors like SplitText, SplitXML, etc wait
> >>> > until
> >>> > whole file processing and stores the results FlowFile ArrayList
> within
> >>> > the
> >>> > processor this cause heap size/outofmemory issues)
> >>> >
> >>> > 4) Handled File errors and record errors gracefully using user
> defined
> >>> > configurations and processor routes.
> >>> >
> >>> > Can anyone suggest how to proceed further whether I have to open new
> >>> > issue
> >>> > or if I have to use any existing issue. ( I don't find any which
> >>> > matches to
> >>> > this requirement)
> >>
> >>
> >>
> >
>

Re: Create a CovertCSVToJSON Processor

Posted by Joe Witt <jo...@gmail.com>.

Venkat,

The only heap issues that could be consider common are if you're doing
'SplitText' and trying to go from hundreds of thousands or millions of
lines files to a single line output in a single processor.  You can
easily overcome that by doing a two phase split where the first
processor cuts into say 1000 line chunks and the next one does single
line chunks.  That said, with this record approach it doesn't have
that problem at all so the only cause for memory issues there would be
if any single record is so large that it takes up all the memory which
doesn't appear likely for your examples.

Thanks

On Tue, Jun 6, 2017 at 11:49 AM, Venkat Williams
<ve...@gmail.com> wrote:
> Thanks Mark for helping me to build a template and test Covert CSV to JSON
> processing.
>
> I want to know is it possible to emit transformed records as it is to next
> processor rather than waiting for full file processing and keep the entire
> result in single flowfile.
>
> Input:
> id,topic,hits
> Rahul,scala,120
> Nikita,spark,80
> Mithun,spark,1
> myself,cca175,180
>
> Actual Output:
> [{"id":"Rahul","topic":"scala","hits":120},{"id":"Nikita","topic":"spark","hits":80},{"id":"Mithun","topic":"spark","hits":1},{"id":"myself","topic":"cca175","hits":180}]
>
> Expected output:(multiple flow files like split result)
> {"id":"Rahul","topic":"scala","hits":120}
> {"id":"Nikita","topic":"spark","hits":80}
> {"id":"Mithun","topic":"spark","hits":1}
> {"id":"myself","topic":"cca175","hits":180}
>
> By doing this I can overcome Heap/outofmemory issues which are so common.
> (scenario. have limited NIFI 1 GB RAM want to process 5 GB input data)
>
> Regards,
> Venkat
>
> On Tue, Jun 6, 2017 at 8:32 PM, Mark Payne <ma...@hotmail.com> wrote:
>>
>> Hi Venkat,
>>
>> I just published a blog post [1] on running SQL in NiFi. The post walks
>> through creating a CSV Record Reader,
>> running SQL over the data, and then writing the results in JSON. This may
>> be helpful to you. In your case,
>> you may want to just use the ConvertRecord processor, rather than
>> QueryRecord, but the concepts of creating
>> the Record Reader and Writer are the same. This post references another
>> post [2] that I wrote a week or two ago
>> that gives a bit more details on how to actually create the reader and
>> writer.
>>
>> The CSV Reader uses Apache Commons CSV, so it will support RFC-4180,
>> embedded newlines, escaped
>> double-quotes, etc.
>>
>> I hope this helps give some direction in how to handle this in NiFi.
>>
>> Thanks
>> -Mark
>>
>> [1] https://blogs.apache.org/nifi/entry/real-time-sql-on-event
>> [2] https://blogs.apache.org/nifi/entry/record-oriented-data-with-nifi
>>
>>
>> On Jun 6, 2017, at 9:52 AM, Venkat Williams <ve...@gmail.com>
>> wrote:
>>
>> Hi Joe Witt,
>>
>> Thanks for your response.
>>
>> I heard and read about about these record readers but not quite got it how
>> to use them using some test data or template. It will be great if you can
>> help me to get some working example or flow.
>>
>> I want to know if these implementations support for RFC-4180 formatted CSV
>> files and be sure to handle edge cases like embedded newlines in a field
>> value and escaped double quotes.
>>
>> Thanks for your help advance.
>>
>> Regards,
>> Venkat
>>
>> On Tue, Jun 6, 2017 at 7:07 PM, Joe Witt <jo...@gmail.com> wrote:
>>>
>>> Venkat
>>>
>>> I think you'll want to take a closer look at the apache nifi 1.2.0
>>> release support for record readers and record writers.  It handles
>>> schema aware parsing/transformation and more for things like csv,
>>> json, avro, can be easily extended, and supports scripted readers and
>>> writers written right there through the UI.  As it is new examples are
>>> still emerging but we can certainly help you along.
>>>
>>> Thanks
>>> Joe
>>>
>>> On Tue, Jun 6, 2017 at 3:12 AM, Venkat Williams
>>> <ve...@gmail.com> wrote:
>>> > Hi
>>> >
>>> >
>>> >
>>> > I want to contribute this processor implementation code to NIFI
>>> > project.
>>> >
>>> >
>>> >
>>> > Requirements:
>>> >
>>> >
>>> >
>>> > 1)     Convert CSV files to a standard/canonical JSON format
>>> >
>>> > a.       One JSON object/document per row in the input CSV
>>> >
>>> > b.      Format should encode the data as JSON fields and values
>>> >
>>> > c.       JSON Field names should be the original column header with any
>>> > invalid characters handled properly.
>>> >
>>> > d.      Values should be kept unaltered
>>> >
>>> > 2)     Optionally, be able to specify an expected header used to
>>> > validate/reject input CSVs
>>> >
>>> > 3)     Support both tab and comma delimited files
>>> >
>>> > a.     Auto-detect based on header row is easy
>>> >
>>> > b.    Allow operator to specify the delimiter as a way to override the
>>> > auto-detect logic
>>> >
>>> > 4)     Handle arbitrarily large files...
>>> >
>>> > a.       should handle CSV files of any length ( achieve this using
>>> > batching)
>>> >
>>> > 5)     Handle errors gracefully
>>> >
>>> > a.       File failures
>>> >
>>> > b.      Row failures
>>> >
>>> > 6)     Support for RFC-4180 formatted CSV files and be sure to handle
>>> > edge
>>> > cases like embedded newlines in a field value and escaped double quotes
>>> >
>>> >
>>> >
>>> > Example:
>>> >
>>> > Input CSV:
>>> >
>>> > user,source_ip,source_country,destination_ip,url,timestamp
>>> >
>>> >
>>> > Venkat,192.168.0.1,IN,23.246.97.82,http://www.google.com,2017-02-22T14:46:24-05:00
>>> >
>>> >
>>> >
>>> > Desired output JSON:
>>> >
>>> >
>>> > {"user":"Venkat","source_ip":"192.168.0.1","source_country":"IN","destination_ip":"23.246.97.82","url":"http://www.google.com","timestamp":"2017-02-22T14:46:24-05:00"}
>>> >
>>> >
>>> >
>>> > Implementation:
>>> >
>>> > 1)      Reviewed all the existing csv libraries which can be used to
>>> > transform csv record to json document by supporting  RFC-4180 standard
>>> > to
>>> > handle embedded new lines in field value and escaped quotes. Found
>>> > OpenCSV,
>>> > FastCSV, UnivocityCSV Libraries can do this job most effectively.
>>> >
>>> > 2)      Selected Univocity CSV Library as I can do most of validations
>>> > which
>>> > are part of my requirements only using this library. When I did the
>>> > performance testing using 5 GB and 10GB arbitrarily large files this
>>> > gave
>>> > better results compared any others.
>>> >
>>> > 3)      Processed CSV Records are being emitted immediately rather than
>>> > waiting complete file processing. Used some configurable number in
>>> > processor
>>> > to wait until that many records to emit. With this approach I could
>>> > process
>>> > 5GB CSV data records using 1GB NIFI RAM which is most effective /
>>> > attractive
>>> > feature in this whole implementation to handle large files. ( This is
>>> > common
>>> > limitation in most of processors like SplitText, SplitXML, etc wait
>>> > until
>>> > whole file processing and stores the results FlowFile ArrayList within
>>> > the
>>> > processor this cause heap size/outofmemory issues)
>>> >
>>> > 4) Handled File errors and record errors gracefully using user defined
>>> > configurations and processor routes.
>>> >
>>> > Can anyone suggest how to proceed further whether I have to open new
>>> > issue
>>> > or if I have to use any existing issue. ( I don't find any which
>>> > matches to
>>> > this requirement)
>>
>>
>>
>

Re: Create a CovertCSVToJSON Processor

Posted by Venkat Williams <ve...@gmail.com>.

Thanks Mark for helping me to build a template and test Covert CSV to JSON
processing.

I want to know is it possible to emit transformed records as it is to next
processor rather than waiting for full file processing and keep the entire
result in single flowfile.

Input:
id,topic,hits
Rahul,scala,120
Nikita,spark,80
Mithun,spark,1
myself,cca175,180

Actual Output:
[{"id":"Rahul","topic":"scala","hits":120},{"id":"Nikita","topic":"spark","hits":80},{"id":"Mithun","topic":"spark","hits":1},{"id":"myself","topic":"cca175","hits":180}]

Expected output:(multiple flow files like split result)
{"id":"Rahul","topic":"scala","hits":120}
{"id":"Nikita","topic":"spark","hits":80}
{"id":"Mithun","topic":"spark","hits":1}
{"id":"myself","topic":"cca175","hits":180}

By doing this I can overcome Heap/outofmemory issues which are so common.
(scenario. have limited NIFI 1 GB RAM want to process 5 GB input data)

Regards,
Venkat

On Tue, Jun 6, 2017 at 8:32 PM, Mark Payne <ma...@hotmail.com> wrote:

> Hi Venkat,
>
> I just published a blog post [1] on running SQL in NiFi. The post walks
> through creating a CSV Record Reader,
> running SQL over the data, and then writing the results in JSON. This may
> be helpful to you. In your case,
> you may want to just use the ConvertRecord processor, rather than
> QueryRecord, but the concepts of creating
> the Record Reader and Writer are the same. This post references another
> post [2] that I wrote a week or two ago
> that gives a bit more details on how to actually create the reader and
> writer.
>
> The CSV Reader uses Apache Commons CSV, so it will support RFC-4180,
> embedded newlines, escaped
> double-quotes, etc.
>
> I hope this helps give some direction in how to handle this in NiFi.
>
> Thanks
> -Mark
>
> [1] https://blogs.apache.org/nifi/entry/real-time-sql-on-event
> [2] https://blogs.apache.org/nifi/entry/record-oriented-data-with-nifi
>
>
> On Jun 6, 2017, at 9:52 AM, Venkat Williams <ve...@gmail.com>
> wrote:
>
> Hi Joe Witt,
>
> Thanks for your response.
>
> I heard and read about about these record readers but not quite got it how
> to use them using some test data or template. It will be great if you can
> help me to get some working example or flow.
>
> I want to know if these implementations support for RFC-4180
> <https://tools.ietf.org/html/rfc4180> formatted CSV files and be sure to
> handle edge cases like embedded newlines in a field value and escaped
> double quotes.
>
> Thanks for your help advance.
>
> Regards,
> Venkat
>
> On Tue, Jun 6, 2017 at 7:07 PM, Joe Witt <jo...@gmail.com> wrote:
>
>> Venkat
>>
>> I think you'll want to take a closer look at the apache nifi 1.2.0
>> release support for record readers and record writers.  It handles
>> schema aware parsing/transformation and more for things like csv,
>> json, avro, can be easily extended, and supports scripted readers and
>> writers written right there through the UI.  As it is new examples are
>> still emerging but we can certainly help you along.
>>
>> Thanks
>> Joe
>>
>> On Tue, Jun 6, 2017 at 3:12 AM, Venkat Williams
>> <ve...@gmail.com> wrote:
>> > Hi
>> >
>> >
>> >
>> > I want to contribute this processor implementation code to NIFI project.
>> >
>> >
>> >
>> > Requirements:
>> >
>> >
>> >
>> > 1)     Convert CSV files to a standard/canonical JSON format
>> >
>> > a.       One JSON object/document per row in the input CSV
>> >
>> > b.      Format should encode the data as JSON fields and values
>> >
>> > c.       JSON Field names should be the original column header with any
>> > invalid characters handled properly.
>> >
>> > d.      Values should be kept unaltered
>> >
>> > 2)     Optionally, be able to specify an expected header used to
>> > validate/reject input CSVs
>> >
>> > 3)     Support both tab and comma delimited files
>> >
>> > a.     Auto-detect based on header row is easy
>> >
>> > b.    Allow operator to specify the delimiter as a way to override the
>> > auto-detect logic
>> >
>> > 4)     Handle arbitrarily large files...
>> >
>> > a.       should handle CSV files of any length ( achieve this using
>> > batching)
>> >
>> > 5)     Handle errors gracefully
>> >
>> > a.       File failures
>> >
>> > b.      Row failures
>> >
>> > 6)     Support for RFC-4180 formatted CSV files and be sure to handle
>> edge
>> > cases like embedded newlines in a field value and escaped double quotes
>> >
>> >
>> >
>> > Example:
>> >
>> > Input CSV:
>> >
>> > user,source_ip,source_country,destination_ip,url,timestamp
>> >
>> > Venkat,192.168.0.1,IN,23.246.97.82,http://www.google.com,201
>> 7-02-22T14:46:24-05:00
>> >
>> >
>> >
>> > Desired output JSON:
>> >
>> > {"user":"Venkat","source_ip":"192.168.0.1","source_country":
>> "IN","destination_ip":"23.246.97.82","url":"http://www.google.com
>> ","timestamp":"2017-02-22T14:46:24-05:00"}
>> >
>> >
>> >
>> > Implementation:
>> >
>> > 1)      Reviewed all the existing csv libraries which can be used to
>> > transform csv record to json document by supporting  RFC-4180 standard
>> to
>> > handle embedded new lines in field value and escaped quotes. Found
>> OpenCSV,
>> > FastCSV, UnivocityCSV Libraries can do this job most effectively.
>> >
>> > 2)      Selected Univocity CSV Library as I can do most of validations
>> which
>> > are part of my requirements only using this library. When I did the
>> > performance testing using 5 GB and 10GB arbitrarily large files this
>> gave
>> > better results compared any others.
>> >
>> > 3)      Processed CSV Records are being emitted immediately rather than
>> > waiting complete file processing. Used some configurable number in
>> processor
>> > to wait until that many records to emit. With this approach I could
>> process
>> > 5GB CSV data records using 1GB NIFI RAM which is most effective /
>> attractive
>> > feature in this whole implementation to handle large files. ( This is
>> common
>> > limitation in most of processors like SplitText, SplitXML, etc wait
>> until
>> > whole file processing and stores the results FlowFile ArrayList within
>> the
>> > processor this cause heap size/outofmemory issues)
>> >
>> > 4) Handled File errors and record errors gracefully using user defined
>> > configurations and processor routes.
>> >
>> > Can anyone suggest how to proceed further whether I have to open new
>> issue
>> > or if I have to use any existing issue. ( I don't find any which
>> matches to
>> > this requirement)
>>
>
>
>

Re: Create a CovertCSVToJSON Processor

Posted by Mark Payne <ma...@hotmail.com>.

Hi Venkat,

I just published a blog post [1] on running SQL in NiFi. The post walks through creating a CSV Record Reader,
running SQL over the data, and then writing the results in JSON. This may be helpful to you. In your case,
you may want to just use the ConvertRecord processor, rather than QueryRecord, but the concepts of creating
the Record Reader and Writer are the same. This post references another post [2] that I wrote a week or two ago
that gives a bit more details on how to actually create the reader and writer.

The CSV Reader uses Apache Commons CSV, so it will support RFC-4180, embedded newlines, escaped
double-quotes, etc.

I hope this helps give some direction in how to handle this in NiFi.

Thanks
-Mark

[1] https://blogs.apache.org/nifi/entry/real-time-sql-on-event
[2] https://blogs.apache.org/nifi/entry/record-oriented-data-with-nifi

On Jun 6, 2017, at 9:52 AM, Venkat Williams <ve...@gmail.com>> wrote:

Hi Joe Witt,

Thanks for your response.

I heard and read about about these record readers but not quite got it how to use them using some test data or template. It will be great if you can help me to get some working example or flow.

I want to know if these implementations support for RFC-4180<https://tools.ietf.org/html/rfc4180> formatted CSV files and be sure to handle edge cases like embedded newlines in a field value and escaped double quotes.

Thanks for your help advance.

Regards,
Venkat

On Tue, Jun 6, 2017 at 7:07 PM, Joe Witt <jo...@gmail.com>> wrote:
Venkat

I think you'll want to take a closer look at the apache nifi 1.2.0
release support for record readers and record writers.  It handles
schema aware parsing/transformation and more for things like csv,
json, avro, can be easily extended, and supports scripted readers and
writers written right there through the UI.  As it is new examples are
still emerging but we can certainly help you along.

Thanks
Joe

On Tue, Jun 6, 2017 at 3:12 AM, Venkat Williams
<ve...@gmail.com>> wrote:
> Hi
>
>
>
> I want to contribute this processor implementation code to NIFI project.
>
>
>
> Requirements:
>
>
>
> 1)     Convert CSV files to a standard/canonical JSON format
>
> a.       One JSON object/document per row in the input CSV
>
> b.      Format should encode the data as JSON fields and values
>
> c.       JSON Field names should be the original column header with any
> invalid characters handled properly.
>
> d.      Values should be kept unaltered
>
> 2)     Optionally, be able to specify an expected header used to
> validate/reject input CSVs
>
> 3)     Support both tab and comma delimited files
>
> a.     Auto-detect based on header row is easy
>
> b.    Allow operator to specify the delimiter as a way to override the
> auto-detect logic
>
> 4)     Handle arbitrarily large files...
>
> a.       should handle CSV files of any length ( achieve this using
> batching)
>
> 5)     Handle errors gracefully
>
> a.       File failures
>
> b.      Row failures
>
> 6)     Support for RFC-4180 formatted CSV files and be sure to handle edge
> cases like embedded newlines in a field value and escaped double quotes
>
>
>
> Example:
>
> Input CSV:
>
> user,source_ip,source_country,destination_ip,url,timestamp
>
> Venkat,192.168.0.1,IN,23.246.97.82,http://www.google.com<http://www.google.com/>,2017-02-22T14:46:24-05:00
>
>
>
> Desired output JSON:
>
> {"user":"Venkat","source_ip":"192.168.0.1","source_country":"IN","destination_ip":"23.246.97.82","url":"http://www.google.com<http://www.google.com/>","timestamp":"2017-02-22T14:46:24-05:00"}
>
>
>
> Implementation:
>
> 1)      Reviewed all the existing csv libraries which can be used to
> transform csv record to json document by supporting  RFC-4180 standard to
> handle embedded new lines in field value and escaped quotes. Found OpenCSV,
> FastCSV, UnivocityCSV Libraries can do this job most effectively.
>
> 2)      Selected Univocity CSV Library as I can do most of validations which
> are part of my requirements only using this library. When I did the
> performance testing using 5 GB and 10GB arbitrarily large files this gave
> better results compared any others.
>
> 3)      Processed CSV Records are being emitted immediately rather than
> waiting complete file processing. Used some configurable number in processor
> to wait until that many records to emit. With this approach I could process
> 5GB CSV data records using 1GB NIFI RAM which is most effective / attractive
> feature in this whole implementation to handle large files. ( This is common
> limitation in most of processors like SplitText, SplitXML, etc wait until
> whole file processing and stores the results FlowFile ArrayList within the
> processor this cause heap size/outofmemory issues)
>
> 4) Handled File errors and record errors gracefully using user defined
> configurations and processor routes.
>
> Can anyone suggest how to proceed further whether I have to open new issue
> or if I have to use any existing issue. ( I don't find any which matches to
> this requirement)

Re: Create a CovertCSVToJSON Processor

Posted by Venkat Williams <ve...@gmail.com>.

Hi Joe Witt,

Thanks for your response.

I heard and read about about these record readers but not quite got it how
to use them using some test data or template. It will be great if you can
help me to get some working example or flow.

I want to know if these implementations support for RFC-4180
<https://tools.ietf.org/html/rfc4180> formatted CSV files and be sure to
handle edge cases like embedded newlines in a field value and escaped
double quotes.

Thanks for your help advance.

Regards,
Venkat

On Tue, Jun 6, 2017 at 7:07 PM, Joe Witt <jo...@gmail.com> wrote:

> Venkat
>
> I think you'll want to take a closer look at the apache nifi 1.2.0
> release support for record readers and record writers.  It handles
> schema aware parsing/transformation and more for things like csv,
> json, avro, can be easily extended, and supports scripted readers and
> writers written right there through the UI.  As it is new examples are
> still emerging but we can certainly help you along.
>
> Thanks
> Joe
>
> On Tue, Jun 6, 2017 at 3:12 AM, Venkat Williams
> <ve...@gmail.com> wrote:
> > Hi
> >
> >
> >
> > I want to contribute this processor implementation code to NIFI project.
> >
> >
> >
> > Requirements:
> >
> >
> >
> > 1)     Convert CSV files to a standard/canonical JSON format
> >
> > a.       One JSON object/document per row in the input CSV
> >
> > b.      Format should encode the data as JSON fields and values
> >
> > c.       JSON Field names should be the original column header with any
> > invalid characters handled properly.
> >
> > d.      Values should be kept unaltered
> >
> > 2)     Optionally, be able to specify an expected header used to
> > validate/reject input CSVs
> >
> > 3)     Support both tab and comma delimited files
> >
> > a.     Auto-detect based on header row is easy
> >
> > b.    Allow operator to specify the delimiter as a way to override the
> > auto-detect logic
> >
> > 4)     Handle arbitrarily large files...
> >
> > a.       should handle CSV files of any length ( achieve this using
> > batching)
> >
> > 5)     Handle errors gracefully
> >
> > a.       File failures
> >
> > b.      Row failures
> >
> > 6)     Support for RFC-4180 formatted CSV files and be sure to handle
> edge
> > cases like embedded newlines in a field value and escaped double quotes
> >
> >
> >
> > Example:
> >
> > Input CSV:
> >
> > user,source_ip,source_country,destination_ip,url,timestamp
> >
> > Venkat,192.168.0.1,IN,23.246.97.82,http://www.google.com,
> 2017-02-22T14:46:24-05:00
> >
> >
> >
> > Desired output JSON:
> >
> > {"user":"Venkat","source_ip":"192.168.0.1","source_country":
> "IN","destination_ip":"23.246.97.82","url":"http://www.google.com
> ","timestamp":"2017-02-22T14:46:24-05:00"}
> >
> >
> >
> > Implementation:
> >
> > 1)      Reviewed all the existing csv libraries which can be used to
> > transform csv record to json document by supporting  RFC-4180 standard to
> > handle embedded new lines in field value and escaped quotes. Found
> OpenCSV,
> > FastCSV, UnivocityCSV Libraries can do this job most effectively.
> >
> > 2)      Selected Univocity CSV Library as I can do most of validations
> which
> > are part of my requirements only using this library. When I did the
> > performance testing using 5 GB and 10GB arbitrarily large files this gave
> > better results compared any others.
> >
> > 3)      Processed CSV Records are being emitted immediately rather than
> > waiting complete file processing. Used some configurable number in
> processor
> > to wait until that many records to emit. With this approach I could
> process
> > 5GB CSV data records using 1GB NIFI RAM which is most effective /
> attractive
> > feature in this whole implementation to handle large files. ( This is
> common
> > limitation in most of processors like SplitText, SplitXML, etc wait until
> > whole file processing and stores the results FlowFile ArrayList within
> the
> > processor this cause heap size/outofmemory issues)
> >
> > 4) Handled File errors and record errors gracefully using user defined
> > configurations and processor routes.
> >
> > Can anyone suggest how to proceed further whether I have to open new
> issue
> > or if I have to use any existing issue. ( I don't find any which matches
> to
> > this requirement)
>

Re: Create a CovertCSVToJSON Processor

Posted by Joe Witt <jo...@gmail.com>.

Venkat

I think you'll want to take a closer look at the apache nifi 1.2.0
release support for record readers and record writers.  It handles
schema aware parsing/transformation and more for things like csv,
json, avro, can be easily extended, and supports scripted readers and
writers written right there through the UI.  As it is new examples are
still emerging but we can certainly help you along.

Thanks
Joe

On Tue, Jun 6, 2017 at 3:12 AM, Venkat Williams
<ve...@gmail.com> wrote:
> Hi
>
>
>
> I want to contribute this processor implementation code to NIFI project.
>
>
>
> Requirements:
>
>
>
> 1)     Convert CSV files to a standard/canonical JSON format
>
> a.       One JSON object/document per row in the input CSV
>
> b.      Format should encode the data as JSON fields and values
>
> c.       JSON Field names should be the original column header with any
> invalid characters handled properly.
>
> d.      Values should be kept unaltered
>
> 2)     Optionally, be able to specify an expected header used to
> validate/reject input CSVs
>
> 3)     Support both tab and comma delimited files
>
> a.     Auto-detect based on header row is easy
>
> b.    Allow operator to specify the delimiter as a way to override the
> auto-detect logic
>
> 4)     Handle arbitrarily large files...
>
> a.       should handle CSV files of any length ( achieve this using
> batching)
>
> 5)     Handle errors gracefully
>
> a.       File failures
>
> b.      Row failures
>
> 6)     Support for RFC-4180 formatted CSV files and be sure to handle edge
> cases like embedded newlines in a field value and escaped double quotes
>
>
>
> Example:
>
> Input CSV:
>
> user,source_ip,source_country,destination_ip,url,timestamp
>
> Venkat,192.168.0.1,IN,23.246.97.82,http://www.google.com,2017-02-22T14:46:24-05:00
>
>
>
> Desired output JSON:
>
> {"user":"Venkat","source_ip":"192.168.0.1","source_country":"IN","destination_ip":"23.246.97.82","url":"http://www.google.com","timestamp":"2017-02-22T14:46:24-05:00"}
>
>
>
> Implementation:
>
> 1)      Reviewed all the existing csv libraries which can be used to
> transform csv record to json document by supporting  RFC-4180 standard to
> handle embedded new lines in field value and escaped quotes. Found OpenCSV,
> FastCSV, UnivocityCSV Libraries can do this job most effectively.
>
> 2)      Selected Univocity CSV Library as I can do most of validations which
> are part of my requirements only using this library. When I did the
> performance testing using 5 GB and 10GB arbitrarily large files this gave
> better results compared any others.
>
> 3)      Processed CSV Records are being emitted immediately rather than
> waiting complete file processing. Used some configurable number in processor
> to wait until that many records to emit. With this approach I could process
> 5GB CSV data records using 1GB NIFI RAM which is most effective / attractive
> feature in this whole implementation to handle large files. ( This is common
> limitation in most of processors like SplitText, SplitXML, etc wait until
> whole file processing and stores the results FlowFile ArrayList within the
> processor this cause heap size/outofmemory issues)
>
> 4) Handled File errors and record errors gracefully using user defined
> configurations and processor routes.
>
> Can anyone suggest how to proceed further whether I have to open new issue
> or if I have to use any existing issue. ( I don't find any which matches to
> this requirement)