You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@nifi.apache.org by Boris Tyukin <bo...@boristyukin.com> on 2018/08/07 12:31:07 UTC

AVRO is the only output format with ExecuteSQL

I've been wondering since I started learning NiFi why ExecuteSQL processor
only returns AVRO formatted data. All community examples I've seen then
convert AVRO to json and pretty much all of them then split json to
multiple flows.

I found myself doing the same thing over and over and over again.

Since everyone is doing it, is there a strong reason why AVRO is liked so
much? And why everyone continues doing this 3 step pattern rather than
providing users with an option to output json instead and another option to
output one flowfile or multiple (one per record).

thanks
Boris

Re: AVRO is the only output format with ExecuteSQL

Posted by Mike Thomsen <mi...@gmail.com>.

Boris,

Yeah, you can fork either his branch or his entire repo and try it out.
Also, usual caveat: user beware until it passes code review...

Mike

On Mon, Aug 13, 2018 at 8:36 AM Boris Tyukin <bo...@boristyukin.com> wrote:

> Matt, you are awesome! 15 files changes and 3k lines of code - man, do not
> tell me you did that in just a few days :)
>
> since it has not been merged yet with the master, can I just use your
> personal branch to compile entire nifi? or is it better to cherry pick your
> commit into master? I would like to try it out
>
> Boris
>
> On Fri, Aug 10, 2018 at 4:55 PM Matt Burgess <ma...@apache.org> wrote:
>
>> Boris et al,
>>
>> I put up a PR [1] to add ExecuteSQLRecord and QueryDatabaseTableRecord
>> under NIFI-4517, in case anyone wants to play around with it :)
>>
>> Regards,
>> Matt
>>
>> [1] https://github.com/apache/nifi/pull/2945
>> On Tue, Aug 7, 2018 at 8:30 PM Boris Tyukin <bo...@boristyukin.com>
>> wrote:
>> >
>> > Matt, you rock!! thank you!!
>> >
>> > On Tue, Aug 7, 2018 at 5:16 PM Matt Burgess <ma...@gmail.com>
>> wrote:
>> >>
>> >> Sounds good, it makes the underlying code a bit more complicated but I
>> see from y’all’s points that a “separate” processor is a better user
>> experience. I’m knee deep in it as we speak, hope to have a PR up in a few
>> days.
>> >>
>> >> Thanks,
>> >> Matt
>> >>
>> >>
>> >> On Aug 7, 2018, at 5:07 PM, Andrew Grande <ap...@gmail.com> wrote:
>> >>
>> >> I'd really like to see the Record suffix on the processor for
>> discoverability, as already mentioned.
>> >>
>> >> Andrew
>> >>
>> >> On Tue, Aug 7, 2018, 2:16 PM Matt Burgess <ma...@apache.org>
>> wrote:
>> >>>
>> >>> Yeah that's definitely doable, most of the logic for writing a
>> >>> ResultSet to a Flow File is localized (currently to JdbcCommon but
>> >>> also in ResultSetRecordSet), so I wouldn't think it would be too much
>> >>> refactor. What are folks thoughts on whether to add a Record Writer
>> >>> property to the existing ExecuteSQL or subclass it to a new processor
>> >>> called ExecuteSQLRecord? The former is more consistent with how the
>> >>> SiteToSite reporting tasks work, but this is a processor. The latter
>> >>> is more consistent with the way we've done other record processors,
>> >>> and the benefit there is that we don't have to add a bunch of
>> >>> documentation to fields that will be ignored (such as the Use Avro
>> >>> Logical Types property which we wouldn't need in a ExecuteSQLRecord).
>> >>> Having said that, we will want to offer the same options in the Avro
>> >>> Reader/Writer, but Peter is working on that under NIFI-5405 [1].
>> >>>
>> >>> Thanks,
>> >>> Matt
>> >>>
>> >>> [1] https://issues.apache.org/jira/browse/NIFI-5405
>> >>>
>> >>> On Tue, Aug 7, 2018 at 2:06 PM Andy LoPresto <al...@apache.org>
>> wrote:
>> >>> >
>> >>> > Matt,
>> >>> >
>> >>> > Would extending the core ExecuteSQL processor with an
>> ExecuteSQLRecord processor also work? I wonder about discoverability if
>> only one processor is present and in other places we explicitly name the
>> processors which handle records as such. If the ExecuteSQL processor
>> handled all the SQL logic, and the ExecuteSQLRecord processor just
>> delegated most of the processing in its #onTrigger() method to super, do
>> you foresee any substantial difficulties? It might require some refactoring
>> of the parent #onTrigger() to service methods.
>> >>> >
>> >>> >
>> >>> > Andy LoPresto
>> >>> > alopresto@apache.org
>> >>> > alopresto.apache@gmail.com
>> >>> > PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69
>> >>> >
>> >>> > On Aug 7, 2018, at 10:25 AM, Andrew Grande <ap...@gmail.com>
>> wrote:
>> >>> >
>> >>> > As a side note, one has to ha e a serious justification _not_ to
>> use record-based processors. The benefits, including performance, are too
>> numerous to call out here.
>> >>> >
>> >>> > Andrew
>> >>> >
>> >>> > On Tue, Aug 7, 2018, 1:15 PM Mark Payne <ma...@hotmail.com>
>> wrote:
>> >>> >>
>> >>> >> Boris,
>> >>> >>
>> >>> >> Using a Record-based processor does not mean that you need to
>> define a schema upfront. This is
>> >>> >> necessary if the source itself cannot provide a schema. However,
>> since it is pulling structured data
>> >>> >> and the schema can be inferred from the database, you wouldn't
>> need to. As Matt was saying, your
>> >>> >> Record Writer can simply be configured to Inherit Record Schema.
>> It can then write the schema to
>> >>> >> the "avro.schema" attribute or you can choose "Do Not Write
>> Schema". This would still allow the data
>> >>> >> to be written in JSON, CSV, etc.
>> >>> >>
>> >>> >> You could also have the Record Writer choose to write the schema
>> using the "avro.schema" attribute,
>> >>> >> as mentioned above, and then have any down-stream processors read
>> the schema from this attribute.
>> >>> >> This would allow you to use any record-oriented processors you'd
>> like without having to define the
>> >>> >> schema yourself, if you don't want to.
>> >>> >>
>> >>> >> Thanks
>> >>> >> -Mark
>> >>> >>
>> >>> >>
>> >>> >>
>> >>> >> On Aug 7, 2018, at 12:37 PM, Boris Tyukin <bo...@boristyukin.com>
>> wrote:
>> >>> >>
>> >>> >> thanks for all the responses! it means I am not the only one
>> interested in this topic.
>> >>> >>
>> >>> >> Record-aware version would be really nice, but a lot of times I do
>> not want to use record-based processors since I need to define a schema for
>> input/output upfront and just want to run SQL query and get whatever
>> results back. It just adds an extra step that will be subject to
>> break/support.
>> >>> >>
>> >>> >> Similar to Kafka processors, it is nice to have an option of
>> record-based processor vs. message oriented processor. But if one processor
>> can do it all, it is even better :)
>> >>> >>
>> >>> >>
>> >>> >> On Tue, Aug 7, 2018 at 9:28 AM Matt Burgess <ma...@apache.org>
>> wrote:
>> >>> >>>
>> >>> >>> I'm definitely interested in supporting a record-aware version as
>> well
>> >>> >>> (I wrote the Jira up last year [1] but haven't gotten around to
>> >>> >>> implementing it), however I agree with Peter's comment on the
>> Jira.
>> >>> >>> Since ExecuteSQL is an oft-touched processor, if we had two
>> processors
>> >>> >>> that only differed in how the output is formatted, it could be
>> harder
>> >>> >>> to maintain (bugs to be fixed in two places, e.g.). I think we
>> should
>> >>> >>> add an optional RecordWriter property to ExecuteSQL, and the
>> >>> >>> documentation would reflect that if it is not set, the output
>> will be
>> >>> >>> Avro with embedded schema as it has always been. If the
>> RecordWriter
>> >>> >>> is set, either the schema can be hardcoded, or they can use
>> "Inherit
>> >>> >>> Record Schema" even though there's no reader, and that would
>> mimic the
>> >>> >>> current behavior where the schema is inferred from the database
>> >>> >>> columns and used for the writer. There is precedence for this
>> pattern
>> >>> >>> in the SiteToSite reporting tasks.
>> >>> >>>
>> >>> >>> To Bryan's point about history, Avro at the time was the most
>> >>> >>> descriptive of the solutions because it maintains the schema and
>> >>> >>> datatypes with the data, unlike JSON, CSV, etc. Also before the
>> record
>> >>> >>> readers/writers, as Bryan said, you pretty much had to split,
>> >>> >>> transform, merge. We just need to make that processor (and others
>> with
>> >>> >>> specific input/output formats) "record-aware" for better
>> performance.
>> >>> >>>
>> >>> >>> Regards,
>> >>> >>> Matt
>> >>> >>>
>> >>> >>> [1] https://issues.apache.org/jira/browse/NIFI-4517
>> >>> >>> On Tue, Aug 7, 2018 at 9:20 AM Bryan Bende <bb...@gmail.com>
>> wrote:
>> >>> >>> >
>> >>> >>> > I would also add that the pattern of splitting to 1 record per
>> flow
>> >>> >>> > file was common before the record processors existed, and
>> generally
>> >>> >>> > this can/should be avoided now in favor of
>> processing/manipulating
>> >>> >>> > records in place, and keeping them together in large batches.
>> >>> >>> >
>> >>> >>> >
>> >>> >>> >
>> >>> >>> > On Tue, Aug 7, 2018 at 9:10 AM, Andrew Grande <
>> aperepel@gmail.com> wrote:
>> >>> >>> > > Careful, that makes too much sense, Joe ;)
>> >>> >>> > >
>> >>> >>> > >
>> >>> >>> > > On Tue, Aug 7, 2018, 8:45 AM Joe Witt <jo...@gmail.com>
>> wrote:
>> >>> >>> > >>
>> >>> >>> > >> i think we just need to make an ExecuteSqlRecord processor.
>> >>> >>> > >>
>> >>> >>> > >> thanks
>> >>> >>> > >>
>> >>> >>> > >> On Tue, Aug 7, 2018, 8:41 AM Mike Thomsen <
>> mikerthomsen@gmail.com> wrote:
>> >>> >>> > >>>
>> >>> >>> > >>> My guess is that it is due to the fact that Avro is the
>> only record type
>> >>> >>> > >>> that can match sql pretty closely feature to feature on
>> data types.
>> >>> >>> > >>> On Tue, Aug 7, 2018 at 8:33 AM Boris Tyukin <
>> boris@boristyukin.com>
>> >>> >>> > >>> wrote:
>> >>> >>> > >>>>
>> >>> >>> > >>>> I've been wondering since I started learning NiFi why
>> ExecuteSQL
>> >>> >>> > >>>> processor only returns AVRO formatted data. All community
>> examples I've seen
>> >>> >>> > >>>> then convert AVRO to json and pretty much all of them then
>> split json to
>> >>> >>> > >>>> multiple flows.
>> >>> >>> > >>>>
>> >>> >>> > >>>> I found myself doing the same thing over and over and over
>> again.
>> >>> >>> > >>>>
>> >>> >>> > >>>> Since everyone is doing it, is there a strong reason why
>> AVRO is liked
>> >>> >>> > >>>> so much? And why everyone continues doing this 3 step
>> pattern rather than
>> >>> >>> > >>>> providing users with an option to output json instead and
>> another option to
>> >>> >>> > >>>> output one flowfile or multiple (one per record).
>> >>> >>> > >>>>
>> >>> >>> > >>>> thanks
>> >>> >>> > >>>> Boris
>> >>> >>
>> >>> >>
>> >>> >
>>
>

Re: AVRO is the only output format with ExecuteSQL

Posted by Matt Burgess <ma...@apache.org>.

Haha thanks, but I can't take credit for that much throughput ;) I
moved 99% of ExecuteSQL out to a base class, since the main difference
was a line or two of code to do the actual write and update the
attributes, then the two processors just contain the differences in
logic and properties between them.

Often times with a PR you can just checkout my branch, build the NAR,
and drop it into your own assembly, but because there were record
changes (ResultSetRecordSet), you're better off building from my
branch (which is based on master as of Friday) or cherry-picking in my
commit to your master (but be sure to reset --hard upstream/master
before you pull new stuff down, or you'll get an unwanted merge
commit).

Please let me know how/if it works for you!

Thanks,
Matt

On Mon, Aug 13, 2018 at 8:36 AM Boris Tyukin <bo...@boristyukin.com> wrote:
>
> Matt, you are awesome! 15 files changes and 3k lines of code - man, do not tell me you did that in just a few days :)
>
> since it has not been merged yet with the master, can I just use your personal branch to compile entire nifi? or is it better to cherry pick your commit into master? I would like to try it out
>
> Boris
>
> On Fri, Aug 10, 2018 at 4:55 PM Matt Burgess <ma...@apache.org> wrote:
>>
>> Boris et al,
>>
>> I put up a PR [1] to add ExecuteSQLRecord and QueryDatabaseTableRecord
>> under NIFI-4517, in case anyone wants to play around with it :)
>>
>> Regards,
>> Matt
>>
>> [1] https://github.com/apache/nifi/pull/2945
>> On Tue, Aug 7, 2018 at 8:30 PM Boris Tyukin <bo...@boristyukin.com> wrote:
>> >
>> > Matt, you rock!! thank you!!
>> >
>> > On Tue, Aug 7, 2018 at 5:16 PM Matt Burgess <ma...@gmail.com> wrote:
>> >>
>> >> Sounds good, it makes the underlying code a bit more complicated but I see from y’all’s points that a “separate” processor is a better user experience. I’m knee deep in it as we speak, hope to have a PR up in a few days.
>> >>
>> >> Thanks,
>> >> Matt
>> >>
>> >>
>> >> On Aug 7, 2018, at 5:07 PM, Andrew Grande <ap...@gmail.com> wrote:
>> >>
>> >> I'd really like to see the Record suffix on the processor for discoverability, as already mentioned.
>> >>
>> >> Andrew
>> >>
>> >> On Tue, Aug 7, 2018, 2:16 PM Matt Burgess <ma...@apache.org> wrote:
>> >>>
>> >>> Yeah that's definitely doable, most of the logic for writing a
>> >>> ResultSet to a Flow File is localized (currently to JdbcCommon but
>> >>> also in ResultSetRecordSet), so I wouldn't think it would be too much
>> >>> refactor. What are folks thoughts on whether to add a Record Writer
>> >>> property to the existing ExecuteSQL or subclass it to a new processor
>> >>> called ExecuteSQLRecord? The former is more consistent with how the
>> >>> SiteToSite reporting tasks work, but this is a processor. The latter
>> >>> is more consistent with the way we've done other record processors,
>> >>> and the benefit there is that we don't have to add a bunch of
>> >>> documentation to fields that will be ignored (such as the Use Avro
>> >>> Logical Types property which we wouldn't need in a ExecuteSQLRecord).
>> >>> Having said that, we will want to offer the same options in the Avro
>> >>> Reader/Writer, but Peter is working on that under NIFI-5405 [1].
>> >>>
>> >>> Thanks,
>> >>> Matt
>> >>>
>> >>> [1] https://issues.apache.org/jira/browse/NIFI-5405
>> >>>
>> >>> On Tue, Aug 7, 2018 at 2:06 PM Andy LoPresto <al...@apache.org> wrote:
>> >>> >
>> >>> > Matt,
>> >>> >
>> >>> > Would extending the core ExecuteSQL processor with an ExecuteSQLRecord processor also work? I wonder about discoverability if only one processor is present and in other places we explicitly name the processors which handle records as such. If the ExecuteSQL processor handled all the SQL logic, and the ExecuteSQLRecord processor just delegated most of the processing in its #onTrigger() method to super, do you foresee any substantial difficulties? It might require some refactoring of the parent #onTrigger() to service methods.
>> >>> >
>> >>> >
>> >>> > Andy LoPresto
>> >>> > alopresto@apache.org
>> >>> > alopresto.apache@gmail.com
>> >>> > PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69
>> >>> >
>> >>> > On Aug 7, 2018, at 10:25 AM, Andrew Grande <ap...@gmail.com> wrote:
>> >>> >
>> >>> > As a side note, one has to ha e a serious justification _not_ to use record-based processors. The benefits, including performance, are too numerous to call out here.
>> >>> >
>> >>> > Andrew
>> >>> >
>> >>> > On Tue, Aug 7, 2018, 1:15 PM Mark Payne <ma...@hotmail.com> wrote:
>> >>> >>
>> >>> >> Boris,
>> >>> >>
>> >>> >> Using a Record-based processor does not mean that you need to define a schema upfront. This is
>> >>> >> necessary if the source itself cannot provide a schema. However, since it is pulling structured data
>> >>> >> and the schema can be inferred from the database, you wouldn't need to. As Matt was saying, your
>> >>> >> Record Writer can simply be configured to Inherit Record Schema. It can then write the schema to
>> >>> >> the "avro.schema" attribute or you can choose "Do Not Write Schema". This would still allow the data
>> >>> >> to be written in JSON, CSV, etc.
>> >>> >>
>> >>> >> You could also have the Record Writer choose to write the schema using the "avro.schema" attribute,
>> >>> >> as mentioned above, and then have any down-stream processors read the schema from this attribute.
>> >>> >> This would allow you to use any record-oriented processors you'd like without having to define the
>> >>> >> schema yourself, if you don't want to.
>> >>> >>
>> >>> >> Thanks
>> >>> >> -Mark
>> >>> >>
>> >>> >>
>> >>> >>
>> >>> >> On Aug 7, 2018, at 12:37 PM, Boris Tyukin <bo...@boristyukin.com> wrote:
>> >>> >>
>> >>> >> thanks for all the responses! it means I am not the only one interested in this topic.
>> >>> >>
>> >>> >> Record-aware version would be really nice, but a lot of times I do not want to use record-based processors since I need to define a schema for input/output upfront and just want to run SQL query and get whatever results back. It just adds an extra step that will be subject to break/support.
>> >>> >>
>> >>> >> Similar to Kafka processors, it is nice to have an option of record-based processor vs. message oriented processor. But if one processor can do it all, it is even better :)
>> >>> >>
>> >>> >>
>> >>> >> On Tue, Aug 7, 2018 at 9:28 AM Matt Burgess <ma...@apache.org> wrote:
>> >>> >>>
>> >>> >>> I'm definitely interested in supporting a record-aware version as well
>> >>> >>> (I wrote the Jira up last year [1] but haven't gotten around to
>> >>> >>> implementing it), however I agree with Peter's comment on the Jira.
>> >>> >>> Since ExecuteSQL is an oft-touched processor, if we had two processors
>> >>> >>> that only differed in how the output is formatted, it could be harder
>> >>> >>> to maintain (bugs to be fixed in two places, e.g.). I think we should
>> >>> >>> add an optional RecordWriter property to ExecuteSQL, and the
>> >>> >>> documentation would reflect that if it is not set, the output will be
>> >>> >>> Avro with embedded schema as it has always been. If the RecordWriter
>> >>> >>> is set, either the schema can be hardcoded, or they can use "Inherit
>> >>> >>> Record Schema" even though there's no reader, and that would mimic the
>> >>> >>> current behavior where the schema is inferred from the database
>> >>> >>> columns and used for the writer. There is precedence for this pattern
>> >>> >>> in the SiteToSite reporting tasks.
>> >>> >>>
>> >>> >>> To Bryan's point about history, Avro at the time was the most
>> >>> >>> descriptive of the solutions because it maintains the schema and
>> >>> >>> datatypes with the data, unlike JSON, CSV, etc. Also before the record
>> >>> >>> readers/writers, as Bryan said, you pretty much had to split,
>> >>> >>> transform, merge. We just need to make that processor (and others with
>> >>> >>> specific input/output formats) "record-aware" for better performance.
>> >>> >>>
>> >>> >>> Regards,
>> >>> >>> Matt
>> >>> >>>
>> >>> >>> [1] https://issues.apache.org/jira/browse/NIFI-4517
>> >>> >>> On Tue, Aug 7, 2018 at 9:20 AM Bryan Bende <bb...@gmail.com> wrote:
>> >>> >>> >
>> >>> >>> > I would also add that the pattern of splitting to 1 record per flow
>> >>> >>> > file was common before the record processors existed, and generally
>> >>> >>> > this can/should be avoided now in favor of processing/manipulating
>> >>> >>> > records in place, and keeping them together in large batches.
>> >>> >>> >
>> >>> >>> >
>> >>> >>> >
>> >>> >>> > On Tue, Aug 7, 2018 at 9:10 AM, Andrew Grande <ap...@gmail.com> wrote:
>> >>> >>> > > Careful, that makes too much sense, Joe ;)
>> >>> >>> > >
>> >>> >>> > >
>> >>> >>> > > On Tue, Aug 7, 2018, 8:45 AM Joe Witt <jo...@gmail.com> wrote:
>> >>> >>> > >>
>> >>> >>> > >> i think we just need to make an ExecuteSqlRecord processor.
>> >>> >>> > >>
>> >>> >>> > >> thanks
>> >>> >>> > >>
>> >>> >>> > >> On Tue, Aug 7, 2018, 8:41 AM Mike Thomsen <mi...@gmail.com> wrote:
>> >>> >>> > >>>
>> >>> >>> > >>> My guess is that it is due to the fact that Avro is the only record type
>> >>> >>> > >>> that can match sql pretty closely feature to feature on data types.
>> >>> >>> > >>> On Tue, Aug 7, 2018 at 8:33 AM Boris Tyukin <bo...@boristyukin.com>
>> >>> >>> > >>> wrote:
>> >>> >>> > >>>>
>> >>> >>> > >>>> I've been wondering since I started learning NiFi why ExecuteSQL
>> >>> >>> > >>>> processor only returns AVRO formatted data. All community examples I've seen
>> >>> >>> > >>>> then convert AVRO to json and pretty much all of them then split json to
>> >>> >>> > >>>> multiple flows.
>> >>> >>> > >>>>
>> >>> >>> > >>>> I found myself doing the same thing over and over and over again.
>> >>> >>> > >>>>
>> >>> >>> > >>>> Since everyone is doing it, is there a strong reason why AVRO is liked
>> >>> >>> > >>>> so much? And why everyone continues doing this 3 step pattern rather than
>> >>> >>> > >>>> providing users with an option to output json instead and another option to
>> >>> >>> > >>>> output one flowfile or multiple (one per record).
>> >>> >>> > >>>>
>> >>> >>> > >>>> thanks
>> >>> >>> > >>>> Boris
>> >>> >>
>> >>> >>
>> >>> >

Re: AVRO is the only output format with ExecuteSQL

Posted by Boris Tyukin <bo...@boristyukin.com>.

great, thanks all!

nice tool, Otto!

On Mon, Aug 13, 2018 at 9:15 AM Otto Fowler <ot...@gmail.com> wrote:

> This script:
> https://github.com/ottobackwards/Metron-and-Nifi-Scripts/blob/master/nifi/checkout-nifi-pr
> will let you checkout any NIFI PR to a local directory and build it.
>
> Just:
> cd tmp
> checkout-nifi-pr 2945
>
> Maybe useful.
>
>
> On August 13, 2018 at 08:36:04, Boris Tyukin (boris@boristyukin.com)
> wrote:
>
> Matt, you are awesome! 15 files changes and 3k lines of code - man, do not
> tell me you did that in just a few days :)
>
> since it has not been merged yet with the master, can I just use your
> personal branch to compile entire nifi? or is it better to cherry pick your
> commit into master? I would like to try it out
>
> Boris
>
> On Fri, Aug 10, 2018 at 4:55 PM Matt Burgess <ma...@apache.org> wrote:
>
>> Boris et al,
>>
>> I put up a PR [1] to add ExecuteSQLRecord and QueryDatabaseTableRecord
>> under NIFI-4517, in case anyone wants to play around with it :)
>>
>> Regards,
>> Matt
>>
>> [1] https://github.com/apache/nifi/pull/2945
>> On Tue, Aug 7, 2018 at 8:30 PM Boris Tyukin <bo...@boristyukin.com>
>> wrote:
>> >
>> > Matt, you rock!! thank you!!
>> >
>> > On Tue, Aug 7, 2018 at 5:16 PM Matt Burgess <ma...@gmail.com>
>> wrote:
>> >>
>> >> Sounds good, it makes the underlying code a bit more complicated but I
>> see from y’all’s points that a “separate” processor is a better user
>> experience. I’m knee deep in it as we speak, hope to have a PR up in a few
>> days.
>> >>
>> >> Thanks,
>> >> Matt
>> >>
>> >>
>> >> On Aug 7, 2018, at 5:07 PM, Andrew Grande <ap...@gmail.com> wrote:
>> >>
>> >> I'd really like to see the Record suffix on the processor for
>> discoverability, as already mentioned.
>> >>
>> >> Andrew
>> >>
>> >> On Tue, Aug 7, 2018, 2:16 PM Matt Burgess <ma...@apache.org>
>> wrote:
>> >>>
>> >>> Yeah that's definitely doable, most of the logic for writing a
>> >>> ResultSet to a Flow File is localized (currently to JdbcCommon but
>> >>> also in ResultSetRecordSet), so I wouldn't think it would be too much
>> >>> refactor. What are folks thoughts on whether to add a Record Writer
>> >>> property to the existing ExecuteSQL or subclass it to a new processor
>> >>> called ExecuteSQLRecord? The former is more consistent with how the
>> >>> SiteToSite reporting tasks work, but this is a processor. The latter
>> >>> is more consistent with the way we've done other record processors,
>> >>> and the benefit there is that we don't have to add a bunch of
>> >>> documentation to fields that will be ignored (such as the Use Avro
>> >>> Logical Types property which we wouldn't need in a ExecuteSQLRecord).
>> >>> Having said that, we will want to offer the same options in the Avro
>> >>> Reader/Writer, but Peter is working on that under NIFI-5405 [1].
>> >>>
>> >>> Thanks,
>> >>> Matt
>> >>>
>> >>> [1] https://issues.apache.org/jira/browse/NIFI-5405
>> >>>
>> >>> On Tue, Aug 7, 2018 at 2:06 PM Andy LoPresto <al...@apache.org>
>> wrote:
>> >>> >
>> >>> > Matt,
>> >>> >
>> >>> > Would extending the core ExecuteSQL processor with an
>> ExecuteSQLRecord processor also work? I wonder about discoverability if
>> only one processor is present and in other places we explicitly name the
>> processors which handle records as such. If the ExecuteSQL processor
>> handled all the SQL logic, and the ExecuteSQLRecord processor just
>> delegated most of the processing in its #onTrigger() method to super, do
>> you foresee any substantial difficulties? It might require some refactoring
>> of the parent #onTrigger() to service methods.
>> >>> >
>> >>> >
>> >>> > Andy LoPresto
>> >>> > alopresto@apache.org
>> >>> > alopresto.apache@gmail.com
>> >>> > PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69
>> >>> >
>> >>> > On Aug 7, 2018, at 10:25 AM, Andrew Grande <ap...@gmail.com>
>> wrote:
>> >>> >
>> >>> > As a side note, one has to ha e a serious justification _not_ to
>> use record-based processors. The benefits, including performance, are too
>> numerous to call out here.
>> >>> >
>> >>> > Andrew
>> >>> >
>> >>> > On Tue, Aug 7, 2018, 1:15 PM Mark Payne <ma...@hotmail.com>
>> wrote:
>> >>> >>
>> >>> >> Boris,
>> >>> >>
>> >>> >> Using a Record-based processor does not mean that you need to
>> define a schema upfront. This is
>> >>> >> necessary if the source itself cannot provide a schema. However,
>> since it is pulling structured data
>> >>> >> and the schema can be inferred from the database, you wouldn't
>> need to. As Matt was saying, your
>> >>> >> Record Writer can simply be configured to Inherit Record Schema.
>> It can then write the schema to
>> >>> >> the "avro.schema" attribute or you can choose "Do Not Write
>> Schema". This would still allow the data
>> >>> >> to be written in JSON, CSV, etc.
>> >>> >>
>> >>> >> You could also have the Record Writer choose to write the schema
>> using the "avro.schema" attribute,
>> >>> >> as mentioned above, and then have any down-stream processors read
>> the schema from this attribute.
>> >>> >> This would allow you to use any record-oriented processors you'd
>> like without having to define the
>> >>> >> schema yourself, if you don't want to.
>> >>> >>
>> >>> >> Thanks
>> >>> >> -Mark
>> >>> >>
>> >>> >>
>> >>> >>
>> >>> >> On Aug 7, 2018, at 12:37 PM, Boris Tyukin <bo...@boristyukin.com>
>> wrote:
>> >>> >>
>> >>> >> thanks for all the responses! it means I am not the only one
>> interested in this topic.
>> >>> >>
>> >>> >> Record-aware version would be really nice, but a lot of times I do
>> not want to use record-based processors since I need to define a schema for
>> input/output upfront and just want to run SQL query and get whatever
>> results back. It just adds an extra step that will be subject to
>> break/support.
>> >>> >>
>> >>> >> Similar to Kafka processors, it is nice to have an option of
>> record-based processor vs. message oriented processor. But if one processor
>> can do it all, it is even better :)
>> >>> >>
>> >>> >>
>> >>> >> On Tue, Aug 7, 2018 at 9:28 AM Matt Burgess <ma...@apache.org>
>> wrote:
>> >>> >>>
>> >>> >>> I'm definitely interested in supporting a record-aware version as
>> well
>> >>> >>> (I wrote the Jira up last year [1] but haven't gotten around to
>> >>> >>> implementing it), however I agree with Peter's comment on the
>> Jira.
>> >>> >>> Since ExecuteSQL is an oft-touched processor, if we had two
>> processors
>> >>> >>> that only differed in how the output is formatted, it could be
>> harder
>> >>> >>> to maintain (bugs to be fixed in two places, e.g.). I think we
>> should
>> >>> >>> add an optional RecordWriter property to ExecuteSQL, and the
>> >>> >>> documentation would reflect that if it is not set, the output
>> will be
>> >>> >>> Avro with embedded schema as it has always been. If the
>> RecordWriter
>> >>> >>> is set, either the schema can be hardcoded, or they can use
>> "Inherit
>> >>> >>> Record Schema" even though there's no reader, and that would
>> mimic the
>> >>> >>> current behavior where the schema is inferred from the database
>> >>> >>> columns and used for the writer. There is precedence for this
>> pattern
>> >>> >>> in the SiteToSite reporting tasks.
>> >>> >>>
>> >>> >>> To Bryan's point about history, Avro at the time was the most
>> >>> >>> descriptive of the solutions because it maintains the schema and
>> >>> >>> datatypes with the data, unlike JSON, CSV, etc. Also before the
>> record
>> >>> >>> readers/writers, as Bryan said, you pretty much had to split,
>> >>> >>> transform, merge. We just need to make that processor (and others
>> with
>> >>> >>> specific input/output formats) "record-aware" for better
>> performance.
>> >>> >>>
>> >>> >>> Regards,
>> >>> >>> Matt
>> >>> >>>
>> >>> >>> [1] https://issues.apache.org/jira/browse/NIFI-4517
>> >>> >>> On Tue, Aug 7, 2018 at 9:20 AM Bryan Bende <bb...@gmail.com>
>> wrote:
>> >>> >>> >
>> >>> >>> > I would also add that the pattern of splitting to 1 record per
>> flow
>> >>> >>> > file was common before the record processors existed, and
>> generally
>> >>> >>> > this can/should be avoided now in favor of
>> processing/manipulating
>> >>> >>> > records in place, and keeping them together in large batches.
>> >>> >>> >
>> >>> >>> >
>> >>> >>> >
>> >>> >>> > On Tue, Aug 7, 2018 at 9:10 AM, Andrew Grande <
>> aperepel@gmail.com> wrote:
>> >>> >>> > > Careful, that makes too much sense, Joe ;)
>> >>> >>> > >
>> >>> >>> > >
>> >>> >>> > > On Tue, Aug 7, 2018, 8:45 AM Joe Witt <jo...@gmail.com>
>> wrote:
>> >>> >>> > >>
>> >>> >>> > >> i think we just need to make an ExecuteSqlRecord processor.
>> >>> >>> > >>
>> >>> >>> > >> thanks
>> >>> >>> > >>
>> >>> >>> > >> On Tue, Aug 7, 2018, 8:41 AM Mike Thomsen <
>> mikerthomsen@gmail.com> wrote:
>> >>> >>> > >>>
>> >>> >>> > >>> My guess is that it is due to the fact that Avro is the
>> only record type
>> >>> >>> > >>> that can match sql pretty closely feature to feature on
>> data types.
>> >>> >>> > >>> On Tue, Aug 7, 2018 at 8:33 AM Boris Tyukin <
>> boris@boristyukin.com>
>> >>> >>> > >>> wrote:
>> >>> >>> > >>>>
>> >>> >>> > >>>> I've been wondering since I started learning NiFi why
>> ExecuteSQL
>> >>> >>> > >>>> processor only returns AVRO formatted data. All community
>> examples I've seen
>> >>> >>> > >>>> then convert AVRO to json and pretty much all of them then
>> split json to
>> >>> >>> > >>>> multiple flows.
>> >>> >>> > >>>>
>> >>> >>> > >>>> I found myself doing the same thing over and over and over
>> again.
>> >>> >>> > >>>>
>> >>> >>> > >>>> Since everyone is doing it, is there a strong reason why
>> AVRO is liked
>> >>> >>> > >>>> so much? And why everyone continues doing this 3 step
>> pattern rather than
>> >>> >>> > >>>> providing users with an option to output json instead and
>> another option to
>> >>> >>> > >>>> output one flowfile or multiple (one per record).
>> >>> >>> > >>>>
>> >>> >>> > >>>> thanks
>> >>> >>> > >>>> Boris
>> >>> >>
>> >>> >>
>> >>> >
>>
>

Re: AVRO is the only output format with ExecuteSQL

Posted by Otto Fowler <ot...@gmail.com>.

This script:
https://github.com/ottobackwards/Metron-and-Nifi-Scripts/blob/master/nifi/checkout-nifi-pr
will let you checkout any NIFI PR to a local directory and build it.

Just:
cd tmp
checkout-nifi-pr 2945

Maybe useful.


On August 13, 2018 at 08:36:04, Boris Tyukin (boris@boristyukin.com) wrote:

Matt, you are awesome! 15 files changes and 3k lines of code - man, do not
tell me you did that in just a few days :)

since it has not been merged yet with the master, can I just use your
personal branch to compile entire nifi? or is it better to cherry pick your
commit into master? I would like to try it out

Boris

On Fri, Aug 10, 2018 at 4:55 PM Matt Burgess <ma...@apache.org> wrote:

> Boris et al,
>
> I put up a PR [1] to add ExecuteSQLRecord and QueryDatabaseTableRecord
> under NIFI-4517, in case anyone wants to play around with it :)
>
> Regards,
> Matt
>
> [1] https://github.com/apache/nifi/pull/2945
> On Tue, Aug 7, 2018 at 8:30 PM Boris Tyukin <bo...@boristyukin.com> wrote:
> >
> > Matt, you rock!! thank you!!
> >
> > On Tue, Aug 7, 2018 at 5:16 PM Matt Burgess <ma...@gmail.com> wrote:
> >>
> >> Sounds good, it makes the underlying code a bit more complicated but I
> see from y’all’s points that a “separate” processor is a better user
> experience. I’m knee deep in it as we speak, hope to have a PR up in a few
> days.
> >>
> >> Thanks,
> >> Matt
> >>
> >>
> >> On Aug 7, 2018, at 5:07 PM, Andrew Grande <ap...@gmail.com> wrote:
> >>
> >> I'd really like to see the Record suffix on the processor for
> discoverability, as already mentioned.
> >>
> >> Andrew
> >>
> >> On Tue, Aug 7, 2018, 2:16 PM Matt Burgess <ma...@apache.org> wrote:
> >>>
> >>> Yeah that's definitely doable, most of the logic for writing a
> >>> ResultSet to a Flow File is localized (currently to JdbcCommon but
> >>> also in ResultSetRecordSet), so I wouldn't think it would be too much
> >>> refactor. What are folks thoughts on whether to add a Record Writer
> >>> property to the existing ExecuteSQL or subclass it to a new processor
> >>> called ExecuteSQLRecord? The former is more consistent with how the
> >>> SiteToSite reporting tasks work, but this is a processor. The latter
> >>> is more consistent with the way we've done other record processors,
> >>> and the benefit there is that we don't have to add a bunch of
> >>> documentation to fields that will be ignored (such as the Use Avro
> >>> Logical Types property which we wouldn't need in a ExecuteSQLRecord).
> >>> Having said that, we will want to offer the same options in the Avro
> >>> Reader/Writer, but Peter is working on that under NIFI-5405 [1].
> >>>
> >>> Thanks,
> >>> Matt
> >>>
> >>> [1] https://issues.apache.org/jira/browse/NIFI-5405
> >>>
> >>> On Tue, Aug 7, 2018 at 2:06 PM Andy LoPresto <al...@apache.org>
> wrote:
> >>> >
> >>> > Matt,
> >>> >
> >>> > Would extending the core ExecuteSQL processor with an
> ExecuteSQLRecord processor also work? I wonder about discoverability if
> only one processor is present and in other places we explicitly name the
> processors which handle records as such. If the ExecuteSQL processor
> handled all the SQL logic, and the ExecuteSQLRecord processor just
> delegated most of the processing in its #onTrigger() method to super, do
> you foresee any substantial difficulties? It might require some refactoring
> of the parent #onTrigger() to service methods.
> >>> >
> >>> >
> >>> > Andy LoPresto
> >>> > alopresto@apache.org
> >>> > alopresto.apache@gmail.com
> >>> > PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69
> >>> >
> >>> > On Aug 7, 2018, at 10:25 AM, Andrew Grande <ap...@gmail.com>
> wrote:
> >>> >
> >>> > As a side note, one has to ha e a serious justification _not_ to use
> record-based processors. The benefits, including performance, are too
> numerous to call out here.
> >>> >
> >>> > Andrew
> >>> >
> >>> > On Tue, Aug 7, 2018, 1:15 PM Mark Payne <ma...@hotmail.com>
> wrote:
> >>> >>
> >>> >> Boris,
> >>> >>
> >>> >> Using a Record-based processor does not mean that you need to
> define a schema upfront. This is
> >>> >> necessary if the source itself cannot provide a schema. However,
> since it is pulling structured data
> >>> >> and the schema can be inferred from the database, you wouldn't need
> to. As Matt was saying, your
> >>> >> Record Writer can simply be configured to Inherit Record Schema. It
> can then write the schema to
> >>> >> the "avro.schema" attribute or you can choose "Do Not Write
> Schema". This would still allow the data
> >>> >> to be written in JSON, CSV, etc.
> >>> >>
> >>> >> You could also have the Record Writer choose to write the schema
> using the "avro.schema" attribute,
> >>> >> as mentioned above, and then have any down-stream processors read
> the schema from this attribute.
> >>> >> This would allow you to use any record-oriented processors you'd
> like without having to define the
> >>> >> schema yourself, if you don't want to.
> >>> >>
> >>> >> Thanks
> >>> >> -Mark
> >>> >>
> >>> >>
> >>> >>
> >>> >> On Aug 7, 2018, at 12:37 PM, Boris Tyukin <bo...@boristyukin.com>
> wrote:
> >>> >>
> >>> >> thanks for all the responses! it means I am not the only one
> interested in this topic.
> >>> >>
> >>> >> Record-aware version would be really nice, but a lot of times I do
> not want to use record-based processors since I need to define a schema for
> input/output upfront and just want to run SQL query and get whatever
> results back. It just adds an extra step that will be subject to
> break/support.
> >>> >>
> >>> >> Similar to Kafka processors, it is nice to have an option of
> record-based processor vs. message oriented processor. But if one processor
> can do it all, it is even better :)
> >>> >>
> >>> >>
> >>> >> On Tue, Aug 7, 2018 at 9:28 AM Matt Burgess <ma...@apache.org>
> wrote:
> >>> >>>
> >>> >>> I'm definitely interested in supporting a record-aware version as
> well
> >>> >>> (I wrote the Jira up last year [1] but haven't gotten around to
> >>> >>> implementing it), however I agree with Peter's comment on the Jira.
> >>> >>> Since ExecuteSQL is an oft-touched processor, if we had two
> processors
> >>> >>> that only differed in how the output is formatted, it could be
> harder
> >>> >>> to maintain (bugs to be fixed in two places, e.g.). I think we
> should
> >>> >>> add an optional RecordWriter property to ExecuteSQL, and the
> >>> >>> documentation would reflect that if it is not set, the output will
> be
> >>> >>> Avro with embedded schema as it has always been. If the
> RecordWriter
> >>> >>> is set, either the schema can be hardcoded, or they can use
> "Inherit
> >>> >>> Record Schema" even though there's no reader, and that would mimic
> the
> >>> >>> current behavior where the schema is inferred from the database
> >>> >>> columns and used for the writer. There is precedence for this
> pattern
> >>> >>> in the SiteToSite reporting tasks.
> >>> >>>
> >>> >>> To Bryan's point about history, Avro at the time was the most
> >>> >>> descriptive of the solutions because it maintains the schema and
> >>> >>> datatypes with the data, unlike JSON, CSV, etc. Also before the
> record
> >>> >>> readers/writers, as Bryan said, you pretty much had to split,
> >>> >>> transform, merge. We just need to make that processor (and others
> with
> >>> >>> specific input/output formats) "record-aware" for better
> performance.
> >>> >>>
> >>> >>> Regards,
> >>> >>> Matt
> >>> >>>
> >>> >>> [1] https://issues.apache.org/jira/browse/NIFI-4517
> >>> >>> On Tue, Aug 7, 2018 at 9:20 AM Bryan Bende <bb...@gmail.com>
> wrote:
> >>> >>> >
> >>> >>> > I would also add that the pattern of splitting to 1 record per
> flow
> >>> >>> > file was common before the record processors existed, and
> generally
> >>> >>> > this can/should be avoided now in favor of
> processing/manipulating
> >>> >>> > records in place, and keeping them together in large batches.
> >>> >>> >
> >>> >>> >
> >>> >>> >
> >>> >>> > On Tue, Aug 7, 2018 at 9:10 AM, Andrew Grande <
> aperepel@gmail.com> wrote:
> >>> >>> > > Careful, that makes too much sense, Joe ;)
> >>> >>> > >
> >>> >>> > >
> >>> >>> > > On Tue, Aug 7, 2018, 8:45 AM Joe Witt <jo...@gmail.com>
> wrote:
> >>> >>> > >>
> >>> >>> > >> i think we just need to make an ExecuteSqlRecord processor.
> >>> >>> > >>
> >>> >>> > >> thanks
> >>> >>> > >>
> >>> >>> > >> On Tue, Aug 7, 2018, 8:41 AM Mike Thomsen <
> mikerthomsen@gmail.com> wrote:
> >>> >>> > >>>
> >>> >>> > >>> My guess is that it is due to the fact that Avro is the only
> record type
> >>> >>> > >>> that can match sql pretty closely feature to feature on data
> types.
> >>> >>> > >>> On Tue, Aug 7, 2018 at 8:33 AM Boris Tyukin <
> boris@boristyukin.com>
> >>> >>> > >>> wrote:
> >>> >>> > >>>>
> >>> >>> > >>>> I've been wondering since I started learning NiFi why
> ExecuteSQL
> >>> >>> > >>>> processor only returns AVRO formatted data. All community
> examples I've seen
> >>> >>> > >>>> then convert AVRO to json and pretty much all of them then
> split json to
> >>> >>> > >>>> multiple flows.
> >>> >>> > >>>>
> >>> >>> > >>>> I found myself doing the same thing over and over and over
> again.
> >>> >>> > >>>>
> >>> >>> > >>>> Since everyone is doing it, is there a strong reason why
> AVRO is liked
> >>> >>> > >>>> so much? And why everyone continues doing this 3 step
> pattern rather than
> >>> >>> > >>>> providing users with an option to output json instead and
> another option to
> >>> >>> > >>>> output one flowfile or multiple (one per record).
> >>> >>> > >>>>
> >>> >>> > >>>> thanks
> >>> >>> > >>>> Boris
> >>> >>
> >>> >>
> >>> >
>

Re: AVRO is the only output format with ExecuteSQL

Posted by Boris Tyukin <bo...@boristyukin.com>.

Matt, you are awesome! 15 files changes and 3k lines of code - man, do not
tell me you did that in just a few days :)

since it has not been merged yet with the master, can I just use your
personal branch to compile entire nifi? or is it better to cherry pick your
commit into master? I would like to try it out

Boris

On Fri, Aug 10, 2018 at 4:55 PM Matt Burgess <ma...@apache.org> wrote:

> Boris et al,
>
> I put up a PR [1] to add ExecuteSQLRecord and QueryDatabaseTableRecord
> under NIFI-4517, in case anyone wants to play around with it :)
>
> Regards,
> Matt
>
> [1] https://github.com/apache/nifi/pull/2945
> On Tue, Aug 7, 2018 at 8:30 PM Boris Tyukin <bo...@boristyukin.com> wrote:
> >
> > Matt, you rock!! thank you!!
> >
> > On Tue, Aug 7, 2018 at 5:16 PM Matt Burgess <ma...@gmail.com> wrote:
> >>
> >> Sounds good, it makes the underlying code a bit more complicated but I
> see from y’all’s points that a “separate” processor is a better user
> experience. I’m knee deep in it as we speak, hope to have a PR up in a few
> days.
> >>
> >> Thanks,
> >> Matt
> >>
> >>
> >> On Aug 7, 2018, at 5:07 PM, Andrew Grande <ap...@gmail.com> wrote:
> >>
> >> I'd really like to see the Record suffix on the processor for
> discoverability, as already mentioned.
> >>
> >> Andrew
> >>
> >> On Tue, Aug 7, 2018, 2:16 PM Matt Burgess <ma...@apache.org> wrote:
> >>>
> >>> Yeah that's definitely doable, most of the logic for writing a
> >>> ResultSet to a Flow File is localized (currently to JdbcCommon but
> >>> also in ResultSetRecordSet), so I wouldn't think it would be too much
> >>> refactor. What are folks thoughts on whether to add a Record Writer
> >>> property to the existing ExecuteSQL or subclass it to a new processor
> >>> called ExecuteSQLRecord? The former is more consistent with how the
> >>> SiteToSite reporting tasks work, but this is a processor. The latter
> >>> is more consistent with the way we've done other record processors,
> >>> and the benefit there is that we don't have to add a bunch of
> >>> documentation to fields that will be ignored (such as the Use Avro
> >>> Logical Types property which we wouldn't need in a ExecuteSQLRecord).
> >>> Having said that, we will want to offer the same options in the Avro
> >>> Reader/Writer, but Peter is working on that under NIFI-5405 [1].
> >>>
> >>> Thanks,
> >>> Matt
> >>>
> >>> [1] https://issues.apache.org/jira/browse/NIFI-5405
> >>>
> >>> On Tue, Aug 7, 2018 at 2:06 PM Andy LoPresto <al...@apache.org>
> wrote:
> >>> >
> >>> > Matt,
> >>> >
> >>> > Would extending the core ExecuteSQL processor with an
> ExecuteSQLRecord processor also work? I wonder about discoverability if
> only one processor is present and in other places we explicitly name the
> processors which handle records as such. If the ExecuteSQL processor
> handled all the SQL logic, and the ExecuteSQLRecord processor just
> delegated most of the processing in its #onTrigger() method to super, do
> you foresee any substantial difficulties? It might require some refactoring
> of the parent #onTrigger() to service methods.
> >>> >
> >>> >
> >>> > Andy LoPresto
> >>> > alopresto@apache.org
> >>> > alopresto.apache@gmail.com
> >>> > PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69
> >>> >
> >>> > On Aug 7, 2018, at 10:25 AM, Andrew Grande <ap...@gmail.com>
> wrote:
> >>> >
> >>> > As a side note, one has to ha e a serious justification _not_ to use
> record-based processors. The benefits, including performance, are too
> numerous to call out here.
> >>> >
> >>> > Andrew
> >>> >
> >>> > On Tue, Aug 7, 2018, 1:15 PM Mark Payne <ma...@hotmail.com>
> wrote:
> >>> >>
> >>> >> Boris,
> >>> >>
> >>> >> Using a Record-based processor does not mean that you need to
> define a schema upfront. This is
> >>> >> necessary if the source itself cannot provide a schema. However,
> since it is pulling structured data
> >>> >> and the schema can be inferred from the database, you wouldn't need
> to. As Matt was saying, your
> >>> >> Record Writer can simply be configured to Inherit Record Schema. It
> can then write the schema to
> >>> >> the "avro.schema" attribute or you can choose "Do Not Write
> Schema". This would still allow the data
> >>> >> to be written in JSON, CSV, etc.
> >>> >>
> >>> >> You could also have the Record Writer choose to write the schema
> using the "avro.schema" attribute,
> >>> >> as mentioned above, and then have any down-stream processors read
> the schema from this attribute.
> >>> >> This would allow you to use any record-oriented processors you'd
> like without having to define the
> >>> >> schema yourself, if you don't want to.
> >>> >>
> >>> >> Thanks
> >>> >> -Mark
> >>> >>
> >>> >>
> >>> >>
> >>> >> On Aug 7, 2018, at 12:37 PM, Boris Tyukin <bo...@boristyukin.com>
> wrote:
> >>> >>
> >>> >> thanks for all the responses! it means I am not the only one
> interested in this topic.
> >>> >>
> >>> >> Record-aware version would be really nice, but a lot of times I do
> not want to use record-based processors since I need to define a schema for
> input/output upfront and just want to run SQL query and get whatever
> results back. It just adds an extra step that will be subject to
> break/support.
> >>> >>
> >>> >> Similar to Kafka processors, it is nice to have an option of
> record-based processor vs. message oriented processor. But if one processor
> can do it all, it is even better :)
> >>> >>
> >>> >>
> >>> >> On Tue, Aug 7, 2018 at 9:28 AM Matt Burgess <ma...@apache.org>
> wrote:
> >>> >>>
> >>> >>> I'm definitely interested in supporting a record-aware version as
> well
> >>> >>> (I wrote the Jira up last year [1] but haven't gotten around to
> >>> >>> implementing it), however I agree with Peter's comment on the Jira.
> >>> >>> Since ExecuteSQL is an oft-touched processor, if we had two
> processors
> >>> >>> that only differed in how the output is formatted, it could be
> harder
> >>> >>> to maintain (bugs to be fixed in two places, e.g.). I think we
> should
> >>> >>> add an optional RecordWriter property to ExecuteSQL, and the
> >>> >>> documentation would reflect that if it is not set, the output will
> be
> >>> >>> Avro with embedded schema as it has always been. If the
> RecordWriter
> >>> >>> is set, either the schema can be hardcoded, or they can use
> "Inherit
> >>> >>> Record Schema" even though there's no reader, and that would mimic
> the
> >>> >>> current behavior where the schema is inferred from the database
> >>> >>> columns and used for the writer. There is precedence for this
> pattern
> >>> >>> in the SiteToSite reporting tasks.
> >>> >>>
> >>> >>> To Bryan's point about history, Avro at the time was the most
> >>> >>> descriptive of the solutions because it maintains the schema and
> >>> >>> datatypes with the data, unlike JSON, CSV, etc. Also before the
> record
> >>> >>> readers/writers, as Bryan said, you pretty much had to split,
> >>> >>> transform, merge. We just need to make that processor (and others
> with
> >>> >>> specific input/output formats) "record-aware" for better
> performance.
> >>> >>>
> >>> >>> Regards,
> >>> >>> Matt
> >>> >>>
> >>> >>> [1] https://issues.apache.org/jira/browse/NIFI-4517
> >>> >>> On Tue, Aug 7, 2018 at 9:20 AM Bryan Bende <bb...@gmail.com>
> wrote:
> >>> >>> >
> >>> >>> > I would also add that the pattern of splitting to 1 record per
> flow
> >>> >>> > file was common before the record processors existed, and
> generally
> >>> >>> > this can/should be avoided now in favor of
> processing/manipulating
> >>> >>> > records in place, and keeping them together in large batches.
> >>> >>> >
> >>> >>> >
> >>> >>> >
> >>> >>> > On Tue, Aug 7, 2018 at 9:10 AM, Andrew Grande <
> aperepel@gmail.com> wrote:
> >>> >>> > > Careful, that makes too much sense, Joe ;)
> >>> >>> > >
> >>> >>> > >
> >>> >>> > > On Tue, Aug 7, 2018, 8:45 AM Joe Witt <jo...@gmail.com>
> wrote:
> >>> >>> > >>
> >>> >>> > >> i think we just need to make an ExecuteSqlRecord processor.
> >>> >>> > >>
> >>> >>> > >> thanks
> >>> >>> > >>
> >>> >>> > >> On Tue, Aug 7, 2018, 8:41 AM Mike Thomsen <
> mikerthomsen@gmail.com> wrote:
> >>> >>> > >>>
> >>> >>> > >>> My guess is that it is due to the fact that Avro is the only
> record type
> >>> >>> > >>> that can match sql pretty closely feature to feature on data
> types.
> >>> >>> > >>> On Tue, Aug 7, 2018 at 8:33 AM Boris Tyukin <
> boris@boristyukin.com>
> >>> >>> > >>> wrote:
> >>> >>> > >>>>
> >>> >>> > >>>> I've been wondering since I started learning NiFi why
> ExecuteSQL
> >>> >>> > >>>> processor only returns AVRO formatted data. All community
> examples I've seen
> >>> >>> > >>>> then convert AVRO to json and pretty much all of them then
> split json to
> >>> >>> > >>>> multiple flows.
> >>> >>> > >>>>
> >>> >>> > >>>> I found myself doing the same thing over and over and over
> again.
> >>> >>> > >>>>
> >>> >>> > >>>> Since everyone is doing it, is there a strong reason why
> AVRO is liked
> >>> >>> > >>>> so much? And why everyone continues doing this 3 step
> pattern rather than
> >>> >>> > >>>> providing users with an option to output json instead and
> another option to
> >>> >>> > >>>> output one flowfile or multiple (one per record).
> >>> >>> > >>>>
> >>> >>> > >>>> thanks
> >>> >>> > >>>> Boris
> >>> >>
> >>> >>
> >>> >
>

Re: AVRO is the only output format with ExecuteSQL

Posted by Matt Burgess <ma...@apache.org>.

Boris et al,

I put up a PR [1] to add ExecuteSQLRecord and QueryDatabaseTableRecord
under NIFI-4517, in case anyone wants to play around with it :)

Regards,
Matt

[1] https://github.com/apache/nifi/pull/2945
On Tue, Aug 7, 2018 at 8:30 PM Boris Tyukin <bo...@boristyukin.com> wrote:
>
> Matt, you rock!! thank you!!
>
> On Tue, Aug 7, 2018 at 5:16 PM Matt Burgess <ma...@gmail.com> wrote:
>>
>> Sounds good, it makes the underlying code a bit more complicated but I see from y’all’s points that a “separate” processor is a better user experience. I’m knee deep in it as we speak, hope to have a PR up in a few days.
>>
>> Thanks,
>> Matt
>>
>>
>> On Aug 7, 2018, at 5:07 PM, Andrew Grande <ap...@gmail.com> wrote:
>>
>> I'd really like to see the Record suffix on the processor for discoverability, as already mentioned.
>>
>> Andrew
>>
>> On Tue, Aug 7, 2018, 2:16 PM Matt Burgess <ma...@apache.org> wrote:
>>>
>>> Yeah that's definitely doable, most of the logic for writing a
>>> ResultSet to a Flow File is localized (currently to JdbcCommon but
>>> also in ResultSetRecordSet), so I wouldn't think it would be too much
>>> refactor. What are folks thoughts on whether to add a Record Writer
>>> property to the existing ExecuteSQL or subclass it to a new processor
>>> called ExecuteSQLRecord? The former is more consistent with how the
>>> SiteToSite reporting tasks work, but this is a processor. The latter
>>> is more consistent with the way we've done other record processors,
>>> and the benefit there is that we don't have to add a bunch of
>>> documentation to fields that will be ignored (such as the Use Avro
>>> Logical Types property which we wouldn't need in a ExecuteSQLRecord).
>>> Having said that, we will want to offer the same options in the Avro
>>> Reader/Writer, but Peter is working on that under NIFI-5405 [1].
>>>
>>> Thanks,
>>> Matt
>>>
>>> [1] https://issues.apache.org/jira/browse/NIFI-5405
>>>
>>> On Tue, Aug 7, 2018 at 2:06 PM Andy LoPresto <al...@apache.org> wrote:
>>> >
>>> > Matt,
>>> >
>>> > Would extending the core ExecuteSQL processor with an ExecuteSQLRecord processor also work? I wonder about discoverability if only one processor is present and in other places we explicitly name the processors which handle records as such. If the ExecuteSQL processor handled all the SQL logic, and the ExecuteSQLRecord processor just delegated most of the processing in its #onTrigger() method to super, do you foresee any substantial difficulties? It might require some refactoring of the parent #onTrigger() to service methods.
>>> >
>>> >
>>> > Andy LoPresto
>>> > alopresto@apache.org
>>> > alopresto.apache@gmail.com
>>> > PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69
>>> >
>>> > On Aug 7, 2018, at 10:25 AM, Andrew Grande <ap...@gmail.com> wrote:
>>> >
>>> > As a side note, one has to ha e a serious justification _not_ to use record-based processors. The benefits, including performance, are too numerous to call out here.
>>> >
>>> > Andrew
>>> >
>>> > On Tue, Aug 7, 2018, 1:15 PM Mark Payne <ma...@hotmail.com> wrote:
>>> >>
>>> >> Boris,
>>> >>
>>> >> Using a Record-based processor does not mean that you need to define a schema upfront. This is
>>> >> necessary if the source itself cannot provide a schema. However, since it is pulling structured data
>>> >> and the schema can be inferred from the database, you wouldn't need to. As Matt was saying, your
>>> >> Record Writer can simply be configured to Inherit Record Schema. It can then write the schema to
>>> >> the "avro.schema" attribute or you can choose "Do Not Write Schema". This would still allow the data
>>> >> to be written in JSON, CSV, etc.
>>> >>
>>> >> You could also have the Record Writer choose to write the schema using the "avro.schema" attribute,
>>> >> as mentioned above, and then have any down-stream processors read the schema from this attribute.
>>> >> This would allow you to use any record-oriented processors you'd like without having to define the
>>> >> schema yourself, if you don't want to.
>>> >>
>>> >> Thanks
>>> >> -Mark
>>> >>
>>> >>
>>> >>
>>> >> On Aug 7, 2018, at 12:37 PM, Boris Tyukin <bo...@boristyukin.com> wrote:
>>> >>
>>> >> thanks for all the responses! it means I am not the only one interested in this topic.
>>> >>
>>> >> Record-aware version would be really nice, but a lot of times I do not want to use record-based processors since I need to define a schema for input/output upfront and just want to run SQL query and get whatever results back. It just adds an extra step that will be subject to break/support.
>>> >>
>>> >> Similar to Kafka processors, it is nice to have an option of record-based processor vs. message oriented processor. But if one processor can do it all, it is even better :)
>>> >>
>>> >>
>>> >> On Tue, Aug 7, 2018 at 9:28 AM Matt Burgess <ma...@apache.org> wrote:
>>> >>>
>>> >>> I'm definitely interested in supporting a record-aware version as well
>>> >>> (I wrote the Jira up last year [1] but haven't gotten around to
>>> >>> implementing it), however I agree with Peter's comment on the Jira.
>>> >>> Since ExecuteSQL is an oft-touched processor, if we had two processors
>>> >>> that only differed in how the output is formatted, it could be harder
>>> >>> to maintain (bugs to be fixed in two places, e.g.). I think we should
>>> >>> add an optional RecordWriter property to ExecuteSQL, and the
>>> >>> documentation would reflect that if it is not set, the output will be
>>> >>> Avro with embedded schema as it has always been. If the RecordWriter
>>> >>> is set, either the schema can be hardcoded, or they can use "Inherit
>>> >>> Record Schema" even though there's no reader, and that would mimic the
>>> >>> current behavior where the schema is inferred from the database
>>> >>> columns and used for the writer. There is precedence for this pattern
>>> >>> in the SiteToSite reporting tasks.
>>> >>>
>>> >>> To Bryan's point about history, Avro at the time was the most
>>> >>> descriptive of the solutions because it maintains the schema and
>>> >>> datatypes with the data, unlike JSON, CSV, etc. Also before the record
>>> >>> readers/writers, as Bryan said, you pretty much had to split,
>>> >>> transform, merge. We just need to make that processor (and others with
>>> >>> specific input/output formats) "record-aware" for better performance.
>>> >>>
>>> >>> Regards,
>>> >>> Matt
>>> >>>
>>> >>> [1] https://issues.apache.org/jira/browse/NIFI-4517
>>> >>> On Tue, Aug 7, 2018 at 9:20 AM Bryan Bende <bb...@gmail.com> wrote:
>>> >>> >
>>> >>> > I would also add that the pattern of splitting to 1 record per flow
>>> >>> > file was common before the record processors existed, and generally
>>> >>> > this can/should be avoided now in favor of processing/manipulating
>>> >>> > records in place, and keeping them together in large batches.
>>> >>> >
>>> >>> >
>>> >>> >
>>> >>> > On Tue, Aug 7, 2018 at 9:10 AM, Andrew Grande <ap...@gmail.com> wrote:
>>> >>> > > Careful, that makes too much sense, Joe ;)
>>> >>> > >
>>> >>> > >
>>> >>> > > On Tue, Aug 7, 2018, 8:45 AM Joe Witt <jo...@gmail.com> wrote:
>>> >>> > >>
>>> >>> > >> i think we just need to make an ExecuteSqlRecord processor.
>>> >>> > >>
>>> >>> > >> thanks
>>> >>> > >>
>>> >>> > >> On Tue, Aug 7, 2018, 8:41 AM Mike Thomsen <mi...@gmail.com> wrote:
>>> >>> > >>>
>>> >>> > >>> My guess is that it is due to the fact that Avro is the only record type
>>> >>> > >>> that can match sql pretty closely feature to feature on data types.
>>> >>> > >>> On Tue, Aug 7, 2018 at 8:33 AM Boris Tyukin <bo...@boristyukin.com>
>>> >>> > >>> wrote:
>>> >>> > >>>>
>>> >>> > >>>> I've been wondering since I started learning NiFi why ExecuteSQL
>>> >>> > >>>> processor only returns AVRO formatted data. All community examples I've seen
>>> >>> > >>>> then convert AVRO to json and pretty much all of them then split json to
>>> >>> > >>>> multiple flows.
>>> >>> > >>>>
>>> >>> > >>>> I found myself doing the same thing over and over and over again.
>>> >>> > >>>>
>>> >>> > >>>> Since everyone is doing it, is there a strong reason why AVRO is liked
>>> >>> > >>>> so much? And why everyone continues doing this 3 step pattern rather than
>>> >>> > >>>> providing users with an option to output json instead and another option to
>>> >>> > >>>> output one flowfile or multiple (one per record).
>>> >>> > >>>>
>>> >>> > >>>> thanks
>>> >>> > >>>> Boris
>>> >>
>>> >>
>>> >

Re: AVRO is the only output format with ExecuteSQL

Posted by Boris Tyukin <bo...@boristyukin.com>.

Matt, you rock!! thank you!!

On Tue, Aug 7, 2018 at 5:16 PM Matt Burgess <ma...@gmail.com> wrote:

> Sounds good, it makes the underlying code a bit more complicated but I see
> from y’all’s points that a “separate” processor is a better user
> experience. I’m knee deep in it as we speak, hope to have a PR up in a few
> days.
>
> Thanks,
> Matt
>
>
> On Aug 7, 2018, at 5:07 PM, Andrew Grande <ap...@gmail.com> wrote:
>
> I'd really like to see the Record suffix on the processor for
> discoverability, as already mentioned.
>
> Andrew
>
> On Tue, Aug 7, 2018, 2:16 PM Matt Burgess <ma...@apache.org> wrote:
>
>> Yeah that's definitely doable, most of the logic for writing a
>> ResultSet to a Flow File is localized (currently to JdbcCommon but
>> also in ResultSetRecordSet), so I wouldn't think it would be too much
>> refactor. What are folks thoughts on whether to add a Record Writer
>> property to the existing ExecuteSQL or subclass it to a new processor
>> called ExecuteSQLRecord? The former is more consistent with how the
>> SiteToSite reporting tasks work, but this is a processor. The latter
>> is more consistent with the way we've done other record processors,
>> and the benefit there is that we don't have to add a bunch of
>> documentation to fields that will be ignored (such as the Use Avro
>> Logical Types property which we wouldn't need in a ExecuteSQLRecord).
>> Having said that, we will want to offer the same options in the Avro
>> Reader/Writer, but Peter is working on that under NIFI-5405 [1].
>>
>> Thanks,
>> Matt
>>
>> [1] https://issues.apache.org/jira/browse/NIFI-5405
>>
>> On Tue, Aug 7, 2018 at 2:06 PM Andy LoPresto <al...@apache.org>
>> wrote:
>> >
>> > Matt,
>> >
>> > Would extending the core ExecuteSQL processor with an ExecuteSQLRecord
>> processor also work? I wonder about discoverability if only one processor
>> is present and in other places we explicitly name the processors which
>> handle records as such. If the ExecuteSQL processor handled all the SQL
>> logic, and the ExecuteSQLRecord processor just delegated most of the
>> processing in its #onTrigger() method to super, do you foresee any
>> substantial difficulties? It might require some refactoring of the parent
>> #onTrigger() to service methods.
>> >
>> >
>> > Andy LoPresto
>> > alopresto@apache.org
>> > alopresto.apache@gmail.com
>> > PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69
>> >
>> > On Aug 7, 2018, at 10:25 AM, Andrew Grande <ap...@gmail.com> wrote:
>> >
>> > As a side note, one has to ha e a serious justification _not_ to use
>> record-based processors. The benefits, including performance, are too
>> numerous to call out here.
>> >
>> > Andrew
>> >
>> > On Tue, Aug 7, 2018, 1:15 PM Mark Payne <ma...@hotmail.com> wrote:
>> >>
>> >> Boris,
>> >>
>> >> Using a Record-based processor does not mean that you need to define a
>> schema upfront. This is
>> >> necessary if the source itself cannot provide a schema. However, since
>> it is pulling structured data
>> >> and the schema can be inferred from the database, you wouldn't need
>> to. As Matt was saying, your
>> >> Record Writer can simply be configured to Inherit Record Schema. It
>> can then write the schema to
>> >> the "avro.schema" attribute or you can choose "Do Not Write Schema".
>> This would still allow the data
>> >> to be written in JSON, CSV, etc.
>> >>
>> >> You could also have the Record Writer choose to write the schema using
>> the "avro.schema" attribute,
>> >> as mentioned above, and then have any down-stream processors read the
>> schema from this attribute.
>> >> This would allow you to use any record-oriented processors you'd like
>> without having to define the
>> >> schema yourself, if you don't want to.
>> >>
>> >> Thanks
>> >> -Mark
>> >>
>> >>
>> >>
>> >> On Aug 7, 2018, at 12:37 PM, Boris Tyukin <bo...@boristyukin.com>
>> wrote:
>> >>
>> >> thanks for all the responses! it means I am not the only one
>> interested in this topic.
>> >>
>> >> Record-aware version would be really nice, but a lot of times I do not
>> want to use record-based processors since I need to define a schema for
>> input/output upfront and just want to run SQL query and get whatever
>> results back. It just adds an extra step that will be subject to
>> break/support.
>> >>
>> >> Similar to Kafka processors, it is nice to have an option of
>> record-based processor vs. message oriented processor. But if one processor
>> can do it all, it is even better :)
>> >>
>> >>
>> >> On Tue, Aug 7, 2018 at 9:28 AM Matt Burgess <ma...@apache.org>
>> wrote:
>> >>>
>> >>> I'm definitely interested in supporting a record-aware version as well
>> >>> (I wrote the Jira up last year [1] but haven't gotten around to
>> >>> implementing it), however I agree with Peter's comment on the Jira.
>> >>> Since ExecuteSQL is an oft-touched processor, if we had two processors
>> >>> that only differed in how the output is formatted, it could be harder
>> >>> to maintain (bugs to be fixed in two places, e.g.). I think we should
>> >>> add an optional RecordWriter property to ExecuteSQL, and the
>> >>> documentation would reflect that if it is not set, the output will be
>> >>> Avro with embedded schema as it has always been. If the RecordWriter
>> >>> is set, either the schema can be hardcoded, or they can use "Inherit
>> >>> Record Schema" even though there's no reader, and that would mimic the
>> >>> current behavior where the schema is inferred from the database
>> >>> columns and used for the writer. There is precedence for this pattern
>> >>> in the SiteToSite reporting tasks.
>> >>>
>> >>> To Bryan's point about history, Avro at the time was the most
>> >>> descriptive of the solutions because it maintains the schema and
>> >>> datatypes with the data, unlike JSON, CSV, etc. Also before the record
>> >>> readers/writers, as Bryan said, you pretty much had to split,
>> >>> transform, merge. We just need to make that processor (and others with
>> >>> specific input/output formats) "record-aware" for better performance.
>> >>>
>> >>> Regards,
>> >>> Matt
>> >>>
>> >>> [1] https://issues.apache.org/jira/browse/NIFI-4517
>> >>> On Tue, Aug 7, 2018 at 9:20 AM Bryan Bende <bb...@gmail.com> wrote:
>> >>> >
>> >>> > I would also add that the pattern of splitting to 1 record per flow
>> >>> > file was common before the record processors existed, and generally
>> >>> > this can/should be avoided now in favor of processing/manipulating
>> >>> > records in place, and keeping them together in large batches.
>> >>> >
>> >>> >
>> >>> >
>> >>> > On Tue, Aug 7, 2018 at 9:10 AM, Andrew Grande <ap...@gmail.com>
>> wrote:
>> >>> > > Careful, that makes too much sense, Joe ;)
>> >>> > >
>> >>> > >
>> >>> > > On Tue, Aug 7, 2018, 8:45 AM Joe Witt <jo...@gmail.com> wrote:
>> >>> > >>
>> >>> > >> i think we just need to make an ExecuteSqlRecord processor.
>> >>> > >>
>> >>> > >> thanks
>> >>> > >>
>> >>> > >> On Tue, Aug 7, 2018, 8:41 AM Mike Thomsen <
>> mikerthomsen@gmail.com> wrote:
>> >>> > >>>
>> >>> > >>> My guess is that it is due to the fact that Avro is the only
>> record type
>> >>> > >>> that can match sql pretty closely feature to feature on data
>> types.
>> >>> > >>> On Tue, Aug 7, 2018 at 8:33 AM Boris Tyukin <
>> boris@boristyukin.com>
>> >>> > >>> wrote:
>> >>> > >>>>
>> >>> > >>>> I've been wondering since I started learning NiFi why
>> ExecuteSQL
>> >>> > >>>> processor only returns AVRO formatted data. All community
>> examples I've seen
>> >>> > >>>> then convert AVRO to json and pretty much all of them then
>> split json to
>> >>> > >>>> multiple flows.
>> >>> > >>>>
>> >>> > >>>> I found myself doing the same thing over and over and over
>> again.
>> >>> > >>>>
>> >>> > >>>> Since everyone is doing it, is there a strong reason why AVRO
>> is liked
>> >>> > >>>> so much? And why everyone continues doing this 3 step pattern
>> rather than
>> >>> > >>>> providing users with an option to output json instead and
>> another option to
>> >>> > >>>> output one flowfile or multiple (one per record).
>> >>> > >>>>
>> >>> > >>>> thanks
>> >>> > >>>> Boris
>> >>
>> >>
>> >
>>
>

Re: AVRO is the only output format with ExecuteSQL

Posted by Matt Burgess <ma...@gmail.com>.

Sounds good, it makes the underlying code a bit more complicated but I see from y’all’s points that a “separate” processor is a better user experience. I’m knee deep in it as we speak, hope to have a PR up in a few days.

Thanks,
Matt


> On Aug 7, 2018, at 5:07 PM, Andrew Grande <ap...@gmail.com> wrote:
> 
> I'd really like to see the Record suffix on the processor for discoverability, as already mentioned.
> 
> Andrew
> 
>> On Tue, Aug 7, 2018, 2:16 PM Matt Burgess <ma...@apache.org> wrote:
>> Yeah that's definitely doable, most of the logic for writing a
>> ResultSet to a Flow File is localized (currently to JdbcCommon but
>> also in ResultSetRecordSet), so I wouldn't think it would be too much
>> refactor. What are folks thoughts on whether to add a Record Writer
>> property to the existing ExecuteSQL or subclass it to a new processor
>> called ExecuteSQLRecord? The former is more consistent with how the
>> SiteToSite reporting tasks work, but this is a processor. The latter
>> is more consistent with the way we've done other record processors,
>> and the benefit there is that we don't have to add a bunch of
>> documentation to fields that will be ignored (such as the Use Avro
>> Logical Types property which we wouldn't need in a ExecuteSQLRecord).
>> Having said that, we will want to offer the same options in the Avro
>> Reader/Writer, but Peter is working on that under NIFI-5405 [1].
>> 
>> Thanks,
>> Matt
>> 
>> [1] https://issues.apache.org/jira/browse/NIFI-5405
>> 
>> On Tue, Aug 7, 2018 at 2:06 PM Andy LoPresto <al...@apache.org> wrote:
>> >
>> > Matt,
>> >
>> > Would extending the core ExecuteSQL processor with an ExecuteSQLRecord processor also work? I wonder about discoverability if only one processor is present and in other places we explicitly name the processors which handle records as such. If the ExecuteSQL processor handled all the SQL logic, and the ExecuteSQLRecord processor just delegated most of the processing in its #onTrigger() method to super, do you foresee any substantial difficulties? It might require some refactoring of the parent #onTrigger() to service methods.
>> >
>> >
>> > Andy LoPresto
>> > alopresto@apache.org
>> > alopresto.apache@gmail.com
>> > PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69
>> >
>> > On Aug 7, 2018, at 10:25 AM, Andrew Grande <ap...@gmail.com> wrote:
>> >
>> > As a side note, one has to ha e a serious justification _not_ to use record-based processors. The benefits, including performance, are too numerous to call out here.
>> >
>> > Andrew
>> >
>> > On Tue, Aug 7, 2018, 1:15 PM Mark Payne <ma...@hotmail.com> wrote:
>> >>
>> >> Boris,
>> >>
>> >> Using a Record-based processor does not mean that you need to define a schema upfront. This is
>> >> necessary if the source itself cannot provide a schema. However, since it is pulling structured data
>> >> and the schema can be inferred from the database, you wouldn't need to. As Matt was saying, your
>> >> Record Writer can simply be configured to Inherit Record Schema. It can then write the schema to
>> >> the "avro.schema" attribute or you can choose "Do Not Write Schema". This would still allow the data
>> >> to be written in JSON, CSV, etc.
>> >>
>> >> You could also have the Record Writer choose to write the schema using the "avro.schema" attribute,
>> >> as mentioned above, and then have any down-stream processors read the schema from this attribute.
>> >> This would allow you to use any record-oriented processors you'd like without having to define the
>> >> schema yourself, if you don't want to.
>> >>
>> >> Thanks
>> >> -Mark
>> >>
>> >>
>> >>
>> >> On Aug 7, 2018, at 12:37 PM, Boris Tyukin <bo...@boristyukin.com> wrote:
>> >>
>> >> thanks for all the responses! it means I am not the only one interested in this topic.
>> >>
>> >> Record-aware version would be really nice, but a lot of times I do not want to use record-based processors since I need to define a schema for input/output upfront and just want to run SQL query and get whatever results back. It just adds an extra step that will be subject to break/support.
>> >>
>> >> Similar to Kafka processors, it is nice to have an option of record-based processor vs. message oriented processor. But if one processor can do it all, it is even better :)
>> >>
>> >>
>> >> On Tue, Aug 7, 2018 at 9:28 AM Matt Burgess <ma...@apache.org> wrote:
>> >>>
>> >>> I'm definitely interested in supporting a record-aware version as well
>> >>> (I wrote the Jira up last year [1] but haven't gotten around to
>> >>> implementing it), however I agree with Peter's comment on the Jira.
>> >>> Since ExecuteSQL is an oft-touched processor, if we had two processors
>> >>> that only differed in how the output is formatted, it could be harder
>> >>> to maintain (bugs to be fixed in two places, e.g.). I think we should
>> >>> add an optional RecordWriter property to ExecuteSQL, and the
>> >>> documentation would reflect that if it is not set, the output will be
>> >>> Avro with embedded schema as it has always been. If the RecordWriter
>> >>> is set, either the schema can be hardcoded, or they can use "Inherit
>> >>> Record Schema" even though there's no reader, and that would mimic the
>> >>> current behavior where the schema is inferred from the database
>> >>> columns and used for the writer. There is precedence for this pattern
>> >>> in the SiteToSite reporting tasks.
>> >>>
>> >>> To Bryan's point about history, Avro at the time was the most
>> >>> descriptive of the solutions because it maintains the schema and
>> >>> datatypes with the data, unlike JSON, CSV, etc. Also before the record
>> >>> readers/writers, as Bryan said, you pretty much had to split,
>> >>> transform, merge. We just need to make that processor (and others with
>> >>> specific input/output formats) "record-aware" for better performance.
>> >>>
>> >>> Regards,
>> >>> Matt
>> >>>
>> >>> [1] https://issues.apache.org/jira/browse/NIFI-4517
>> >>> On Tue, Aug 7, 2018 at 9:20 AM Bryan Bende <bb...@gmail.com> wrote:
>> >>> >
>> >>> > I would also add that the pattern of splitting to 1 record per flow
>> >>> > file was common before the record processors existed, and generally
>> >>> > this can/should be avoided now in favor of processing/manipulating
>> >>> > records in place, and keeping them together in large batches.
>> >>> >
>> >>> >
>> >>> >
>> >>> > On Tue, Aug 7, 2018 at 9:10 AM, Andrew Grande <ap...@gmail.com> wrote:
>> >>> > > Careful, that makes too much sense, Joe ;)
>> >>> > >
>> >>> > >
>> >>> > > On Tue, Aug 7, 2018, 8:45 AM Joe Witt <jo...@gmail.com> wrote:
>> >>> > >>
>> >>> > >> i think we just need to make an ExecuteSqlRecord processor.
>> >>> > >>
>> >>> > >> thanks
>> >>> > >>
>> >>> > >> On Tue, Aug 7, 2018, 8:41 AM Mike Thomsen <mi...@gmail.com> wrote:
>> >>> > >>>
>> >>> > >>> My guess is that it is due to the fact that Avro is the only record type
>> >>> > >>> that can match sql pretty closely feature to feature on data types.
>> >>> > >>> On Tue, Aug 7, 2018 at 8:33 AM Boris Tyukin <bo...@boristyukin.com>
>> >>> > >>> wrote:
>> >>> > >>>>
>> >>> > >>>> I've been wondering since I started learning NiFi why ExecuteSQL
>> >>> > >>>> processor only returns AVRO formatted data. All community examples I've seen
>> >>> > >>>> then convert AVRO to json and pretty much all of them then split json to
>> >>> > >>>> multiple flows.
>> >>> > >>>>
>> >>> > >>>> I found myself doing the same thing over and over and over again.
>> >>> > >>>>
>> >>> > >>>> Since everyone is doing it, is there a strong reason why AVRO is liked
>> >>> > >>>> so much? And why everyone continues doing this 3 step pattern rather than
>> >>> > >>>> providing users with an option to output json instead and another option to
>> >>> > >>>> output one flowfile or multiple (one per record).
>> >>> > >>>>
>> >>> > >>>> thanks
>> >>> > >>>> Boris
>> >>
>> >>
>> >

Re: AVRO is the only output format with ExecuteSQL

Posted by Andrew Grande <ap...@gmail.com>.

I'd really like to see the Record suffix on the processor for
discoverability, as already mentioned.

Andrew

On Tue, Aug 7, 2018, 2:16 PM Matt Burgess <ma...@apache.org> wrote:

> Yeah that's definitely doable, most of the logic for writing a
> ResultSet to a Flow File is localized (currently to JdbcCommon but
> also in ResultSetRecordSet), so I wouldn't think it would be too much
> refactor. What are folks thoughts on whether to add a Record Writer
> property to the existing ExecuteSQL or subclass it to a new processor
> called ExecuteSQLRecord? The former is more consistent with how the
> SiteToSite reporting tasks work, but this is a processor. The latter
> is more consistent with the way we've done other record processors,
> and the benefit there is that we don't have to add a bunch of
> documentation to fields that will be ignored (such as the Use Avro
> Logical Types property which we wouldn't need in a ExecuteSQLRecord).
> Having said that, we will want to offer the same options in the Avro
> Reader/Writer, but Peter is working on that under NIFI-5405 [1].
>
> Thanks,
> Matt
>
> [1] https://issues.apache.org/jira/browse/NIFI-5405
>
> On Tue, Aug 7, 2018 at 2:06 PM Andy LoPresto <al...@apache.org> wrote:
> >
> > Matt,
> >
> > Would extending the core ExecuteSQL processor with an ExecuteSQLRecord
> processor also work? I wonder about discoverability if only one processor
> is present and in other places we explicitly name the processors which
> handle records as such. If the ExecuteSQL processor handled all the SQL
> logic, and the ExecuteSQLRecord processor just delegated most of the
> processing in its #onTrigger() method to super, do you foresee any
> substantial difficulties? It might require some refactoring of the parent
> #onTrigger() to service methods.
> >
> >
> > Andy LoPresto
> > alopresto@apache.org
> > alopresto.apache@gmail.com
> > PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69
> >
> > On Aug 7, 2018, at 10:25 AM, Andrew Grande <ap...@gmail.com> wrote:
> >
> > As a side note, one has to ha e a serious justification _not_ to use
> record-based processors. The benefits, including performance, are too
> numerous to call out here.
> >
> > Andrew
> >
> > On Tue, Aug 7, 2018, 1:15 PM Mark Payne <ma...@hotmail.com> wrote:
> >>
> >> Boris,
> >>
> >> Using a Record-based processor does not mean that you need to define a
> schema upfront. This is
> >> necessary if the source itself cannot provide a schema. However, since
> it is pulling structured data
> >> and the schema can be inferred from the database, you wouldn't need to.
> As Matt was saying, your
> >> Record Writer can simply be configured to Inherit Record Schema. It can
> then write the schema to
> >> the "avro.schema" attribute or you can choose "Do Not Write Schema".
> This would still allow the data
> >> to be written in JSON, CSV, etc.
> >>
> >> You could also have the Record Writer choose to write the schema using
> the "avro.schema" attribute,
> >> as mentioned above, and then have any down-stream processors read the
> schema from this attribute.
> >> This would allow you to use any record-oriented processors you'd like
> without having to define the
> >> schema yourself, if you don't want to.
> >>
> >> Thanks
> >> -Mark
> >>
> >>
> >>
> >> On Aug 7, 2018, at 12:37 PM, Boris Tyukin <bo...@boristyukin.com>
> wrote:
> >>
> >> thanks for all the responses! it means I am not the only one interested
> in this topic.
> >>
> >> Record-aware version would be really nice, but a lot of times I do not
> want to use record-based processors since I need to define a schema for
> input/output upfront and just want to run SQL query and get whatever
> results back. It just adds an extra step that will be subject to
> break/support.
> >>
> >> Similar to Kafka processors, it is nice to have an option of
> record-based processor vs. message oriented processor. But if one processor
> can do it all, it is even better :)
> >>
> >>
> >> On Tue, Aug 7, 2018 at 9:28 AM Matt Burgess <ma...@apache.org>
> wrote:
> >>>
> >>> I'm definitely interested in supporting a record-aware version as well
> >>> (I wrote the Jira up last year [1] but haven't gotten around to
> >>> implementing it), however I agree with Peter's comment on the Jira.
> >>> Since ExecuteSQL is an oft-touched processor, if we had two processors
> >>> that only differed in how the output is formatted, it could be harder
> >>> to maintain (bugs to be fixed in two places, e.g.). I think we should
> >>> add an optional RecordWriter property to ExecuteSQL, and the
> >>> documentation would reflect that if it is not set, the output will be
> >>> Avro with embedded schema as it has always been. If the RecordWriter
> >>> is set, either the schema can be hardcoded, or they can use "Inherit
> >>> Record Schema" even though there's no reader, and that would mimic the
> >>> current behavior where the schema is inferred from the database
> >>> columns and used for the writer. There is precedence for this pattern
> >>> in the SiteToSite reporting tasks.
> >>>
> >>> To Bryan's point about history, Avro at the time was the most
> >>> descriptive of the solutions because it maintains the schema and
> >>> datatypes with the data, unlike JSON, CSV, etc. Also before the record
> >>> readers/writers, as Bryan said, you pretty much had to split,
> >>> transform, merge. We just need to make that processor (and others with
> >>> specific input/output formats) "record-aware" for better performance.
> >>>
> >>> Regards,
> >>> Matt
> >>>
> >>> [1] https://issues.apache.org/jira/browse/NIFI-4517
> >>> On Tue, Aug 7, 2018 at 9:20 AM Bryan Bende <bb...@gmail.com> wrote:
> >>> >
> >>> > I would also add that the pattern of splitting to 1 record per flow
> >>> > file was common before the record processors existed, and generally
> >>> > this can/should be avoided now in favor of processing/manipulating
> >>> > records in place, and keeping them together in large batches.
> >>> >
> >>> >
> >>> >
> >>> > On Tue, Aug 7, 2018 at 9:10 AM, Andrew Grande <ap...@gmail.com>
> wrote:
> >>> > > Careful, that makes too much sense, Joe ;)
> >>> > >
> >>> > >
> >>> > > On Tue, Aug 7, 2018, 8:45 AM Joe Witt <jo...@gmail.com> wrote:
> >>> > >>
> >>> > >> i think we just need to make an ExecuteSqlRecord processor.
> >>> > >>
> >>> > >> thanks
> >>> > >>
> >>> > >> On Tue, Aug 7, 2018, 8:41 AM Mike Thomsen <mi...@gmail.com>
> wrote:
> >>> > >>>
> >>> > >>> My guess is that it is due to the fact that Avro is the only
> record type
> >>> > >>> that can match sql pretty closely feature to feature on data
> types.
> >>> > >>> On Tue, Aug 7, 2018 at 8:33 AM Boris Tyukin <
> boris@boristyukin.com>
> >>> > >>> wrote:
> >>> > >>>>
> >>> > >>>> I've been wondering since I started learning NiFi why ExecuteSQL
> >>> > >>>> processor only returns AVRO formatted data. All community
> examples I've seen
> >>> > >>>> then convert AVRO to json and pretty much all of them then
> split json to
> >>> > >>>> multiple flows.
> >>> > >>>>
> >>> > >>>> I found myself doing the same thing over and over and over
> again.
> >>> > >>>>
> >>> > >>>> Since everyone is doing it, is there a strong reason why AVRO
> is liked
> >>> > >>>> so much? And why everyone continues doing this 3 step pattern
> rather than
> >>> > >>>> providing users with an option to output json instead and
> another option to
> >>> > >>>> output one flowfile or multiple (one per record).
> >>> > >>>>
> >>> > >>>> thanks
> >>> > >>>> Boris
> >>
> >>
> >
>

Re: AVRO is the only output format with ExecuteSQL

Posted by Matt Burgess <ma...@apache.org>.

Yeah that's definitely doable, most of the logic for writing a
ResultSet to a Flow File is localized (currently to JdbcCommon but
also in ResultSetRecordSet), so I wouldn't think it would be too much
refactor. What are folks thoughts on whether to add a Record Writer
property to the existing ExecuteSQL or subclass it to a new processor
called ExecuteSQLRecord? The former is more consistent with how the
SiteToSite reporting tasks work, but this is a processor. The latter
is more consistent with the way we've done other record processors,
and the benefit there is that we don't have to add a bunch of
documentation to fields that will be ignored (such as the Use Avro
Logical Types property which we wouldn't need in a ExecuteSQLRecord).
Having said that, we will want to offer the same options in the Avro
Reader/Writer, but Peter is working on that under NIFI-5405 [1].

Thanks,
Matt

[1] https://issues.apache.org/jira/browse/NIFI-5405

On Tue, Aug 7, 2018 at 2:06 PM Andy LoPresto <al...@apache.org> wrote:
>
> Matt,
>
> Would extending the core ExecuteSQL processor with an ExecuteSQLRecord processor also work? I wonder about discoverability if only one processor is present and in other places we explicitly name the processors which handle records as such. If the ExecuteSQL processor handled all the SQL logic, and the ExecuteSQLRecord processor just delegated most of the processing in its #onTrigger() method to super, do you foresee any substantial difficulties? It might require some refactoring of the parent #onTrigger() to service methods.
>
>
> Andy LoPresto
> alopresto@apache.org
> alopresto.apache@gmail.com
> PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69
>
> On Aug 7, 2018, at 10:25 AM, Andrew Grande <ap...@gmail.com> wrote:
>
> As a side note, one has to ha e a serious justification _not_ to use record-based processors. The benefits, including performance, are too numerous to call out here.
>
> Andrew
>
> On Tue, Aug 7, 2018, 1:15 PM Mark Payne <ma...@hotmail.com> wrote:
>>
>> Boris,
>>
>> Using a Record-based processor does not mean that you need to define a schema upfront. This is
>> necessary if the source itself cannot provide a schema. However, since it is pulling structured data
>> and the schema can be inferred from the database, you wouldn't need to. As Matt was saying, your
>> Record Writer can simply be configured to Inherit Record Schema. It can then write the schema to
>> the "avro.schema" attribute or you can choose "Do Not Write Schema". This would still allow the data
>> to be written in JSON, CSV, etc.
>>
>> You could also have the Record Writer choose to write the schema using the "avro.schema" attribute,
>> as mentioned above, and then have any down-stream processors read the schema from this attribute.
>> This would allow you to use any record-oriented processors you'd like without having to define the
>> schema yourself, if you don't want to.
>>
>> Thanks
>> -Mark
>>
>>
>>
>> On Aug 7, 2018, at 12:37 PM, Boris Tyukin <bo...@boristyukin.com> wrote:
>>
>> thanks for all the responses! it means I am not the only one interested in this topic.
>>
>> Record-aware version would be really nice, but a lot of times I do not want to use record-based processors since I need to define a schema for input/output upfront and just want to run SQL query and get whatever results back. It just adds an extra step that will be subject to break/support.
>>
>> Similar to Kafka processors, it is nice to have an option of record-based processor vs. message oriented processor. But if one processor can do it all, it is even better :)
>>
>>
>> On Tue, Aug 7, 2018 at 9:28 AM Matt Burgess <ma...@apache.org> wrote:
>>>
>>> I'm definitely interested in supporting a record-aware version as well
>>> (I wrote the Jira up last year [1] but haven't gotten around to
>>> implementing it), however I agree with Peter's comment on the Jira.
>>> Since ExecuteSQL is an oft-touched processor, if we had two processors
>>> that only differed in how the output is formatted, it could be harder
>>> to maintain (bugs to be fixed in two places, e.g.). I think we should
>>> add an optional RecordWriter property to ExecuteSQL, and the
>>> documentation would reflect that if it is not set, the output will be
>>> Avro with embedded schema as it has always been. If the RecordWriter
>>> is set, either the schema can be hardcoded, or they can use "Inherit
>>> Record Schema" even though there's no reader, and that would mimic the
>>> current behavior where the schema is inferred from the database
>>> columns and used for the writer. There is precedence for this pattern
>>> in the SiteToSite reporting tasks.
>>>
>>> To Bryan's point about history, Avro at the time was the most
>>> descriptive of the solutions because it maintains the schema and
>>> datatypes with the data, unlike JSON, CSV, etc. Also before the record
>>> readers/writers, as Bryan said, you pretty much had to split,
>>> transform, merge. We just need to make that processor (and others with
>>> specific input/output formats) "record-aware" for better performance.
>>>
>>> Regards,
>>> Matt
>>>
>>> [1] https://issues.apache.org/jira/browse/NIFI-4517
>>> On Tue, Aug 7, 2018 at 9:20 AM Bryan Bende <bb...@gmail.com> wrote:
>>> >
>>> > I would also add that the pattern of splitting to 1 record per flow
>>> > file was common before the record processors existed, and generally
>>> > this can/should be avoided now in favor of processing/manipulating
>>> > records in place, and keeping them together in large batches.
>>> >
>>> >
>>> >
>>> > On Tue, Aug 7, 2018 at 9:10 AM, Andrew Grande <ap...@gmail.com> wrote:
>>> > > Careful, that makes too much sense, Joe ;)
>>> > >
>>> > >
>>> > > On Tue, Aug 7, 2018, 8:45 AM Joe Witt <jo...@gmail.com> wrote:
>>> > >>
>>> > >> i think we just need to make an ExecuteSqlRecord processor.
>>> > >>
>>> > >> thanks
>>> > >>
>>> > >> On Tue, Aug 7, 2018, 8:41 AM Mike Thomsen <mi...@gmail.com> wrote:
>>> > >>>
>>> > >>> My guess is that it is due to the fact that Avro is the only record type
>>> > >>> that can match sql pretty closely feature to feature on data types.
>>> > >>> On Tue, Aug 7, 2018 at 8:33 AM Boris Tyukin <bo...@boristyukin.com>
>>> > >>> wrote:
>>> > >>>>
>>> > >>>> I've been wondering since I started learning NiFi why ExecuteSQL
>>> > >>>> processor only returns AVRO formatted data. All community examples I've seen
>>> > >>>> then convert AVRO to json and pretty much all of them then split json to
>>> > >>>> multiple flows.
>>> > >>>>
>>> > >>>> I found myself doing the same thing over and over and over again.
>>> > >>>>
>>> > >>>> Since everyone is doing it, is there a strong reason why AVRO is liked
>>> > >>>> so much? And why everyone continues doing this 3 step pattern rather than
>>> > >>>> providing users with an option to output json instead and another option to
>>> > >>>> output one flowfile or multiple (one per record).
>>> > >>>>
>>> > >>>> thanks
>>> > >>>> Boris
>>
>>
>

Re: AVRO is the only output format with ExecuteSQL

Posted by Andy LoPresto <al...@apache.org>.

Matt,

Would extending the core ExecuteSQL processor with an ExecuteSQLRecord processor also work? I wonder about discoverability if only one processor is present and in other places we explicitly name the processors which handle records as such. If the ExecuteSQL processor handled all the SQL logic, and the ExecuteSQLRecord processor just delegated most of the processing in its #onTrigger() method to super, do you foresee any substantial difficulties? It might require some refactoring of the parent #onTrigger() to service methods.


Andy LoPresto
alopresto@apache.org
alopresto.apache@gmail.com
PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69

> On Aug 7, 2018, at 10:25 AM, Andrew Grande <ap...@gmail.com> wrote:
> 
> As a side note, one has to ha e a serious justification _not_ to use record-based processors. The benefits, including performance, are too numerous to call out here.
> 
> Andrew
> 
> On Tue, Aug 7, 2018, 1:15 PM Mark Payne <markap14@hotmail.com <ma...@hotmail.com>> wrote:
> Boris,
> 
> Using a Record-based processor does not mean that you need to define a schema upfront. This is
> necessary if the source itself cannot provide a schema. However, since it is pulling structured data
> and the schema can be inferred from the database, you wouldn't need to. As Matt was saying, your
> Record Writer can simply be configured to Inherit Record Schema. It can then write the schema to
> the "avro.schema" attribute or you can choose "Do Not Write Schema". This would still allow the data
> to be written in JSON, CSV, etc.
> 
> You could also have the Record Writer choose to write the schema using the "avro.schema" attribute,
> as mentioned above, and then have any down-stream processors read the schema from this attribute.
> This would allow you to use any record-oriented processors you'd like without having to define the
> schema yourself, if you don't want to.
> 
> Thanks
> -Mark
> 
> 
> 
>> On Aug 7, 2018, at 12:37 PM, Boris Tyukin <boris@boristyukin.com <ma...@boristyukin.com>> wrote:
>> 
>> thanks for all the responses! it means I am not the only one interested in this topic.
>> 
>> Record-aware version would be really nice, but a lot of times I do not want to use record-based processors since I need to define a schema for input/output upfront and just want to run SQL query and get whatever results back. It just adds an extra step that will be subject to break/support.
>> 
>> Similar to Kafka processors, it is nice to have an option of record-based processor vs. message oriented processor. But if one processor can do it all, it is even better :)
>> 
>> 
>> On Tue, Aug 7, 2018 at 9:28 AM Matt Burgess <mattyb149@apache.org <ma...@apache.org>> wrote:
>> I'm definitely interested in supporting a record-aware version as well
>> (I wrote the Jira up last year [1] but haven't gotten around to
>> implementing it), however I agree with Peter's comment on the Jira.
>> Since ExecuteSQL is an oft-touched processor, if we had two processors
>> that only differed in how the output is formatted, it could be harder
>> to maintain (bugs to be fixed in two places, e.g.). I think we should
>> add an optional RecordWriter property to ExecuteSQL, and the
>> documentation would reflect that if it is not set, the output will be
>> Avro with embedded schema as it has always been. If the RecordWriter
>> is set, either the schema can be hardcoded, or they can use "Inherit
>> Record Schema" even though there's no reader, and that would mimic the
>> current behavior where the schema is inferred from the database
>> columns and used for the writer. There is precedence for this pattern
>> in the SiteToSite reporting tasks.
>> 
>> To Bryan's point about history, Avro at the time was the most
>> descriptive of the solutions because it maintains the schema and
>> datatypes with the data, unlike JSON, CSV, etc. Also before the record
>> readers/writers, as Bryan said, you pretty much had to split,
>> transform, merge. We just need to make that processor (and others with
>> specific input/output formats) "record-aware" for better performance.
>> 
>> Regards,
>> Matt
>> 
>> [1] https://issues.apache.org/jira/browse/NIFI-4517 <https://issues.apache.org/jira/browse/NIFI-4517>
>> On Tue, Aug 7, 2018 at 9:20 AM Bryan Bende <bbende@gmail.com <ma...@gmail.com>> wrote:
>> >
>> > I would also add that the pattern of splitting to 1 record per flow
>> > file was common before the record processors existed, and generally
>> > this can/should be avoided now in favor of processing/manipulating
>> > records in place, and keeping them together in large batches.
>> >
>> >
>> >
>> > On Tue, Aug 7, 2018 at 9:10 AM, Andrew Grande <aperepel@gmail.com <ma...@gmail.com>> wrote:
>> > > Careful, that makes too much sense, Joe ;)
>> > >
>> > >
>> > > On Tue, Aug 7, 2018, 8:45 AM Joe Witt <joe.witt@gmail.com <ma...@gmail.com>> wrote:
>> > >>
>> > >> i think we just need to make an ExecuteSqlRecord processor.
>> > >>
>> > >> thanks
>> > >>
>> > >> On Tue, Aug 7, 2018, 8:41 AM Mike Thomsen <mikerthomsen@gmail.com <ma...@gmail.com>> wrote:
>> > >>>
>> > >>> My guess is that it is due to the fact that Avro is the only record type
>> > >>> that can match sql pretty closely feature to feature on data types.
>> > >>> On Tue, Aug 7, 2018 at 8:33 AM Boris Tyukin <boris@boristyukin.com <ma...@boristyukin.com>>
>> > >>> wrote:
>> > >>>>
>> > >>>> I've been wondering since I started learning NiFi why ExecuteSQL
>> > >>>> processor only returns AVRO formatted data. All community examples I've seen
>> > >>>> then convert AVRO to json and pretty much all of them then split json to
>> > >>>> multiple flows.
>> > >>>>
>> > >>>> I found myself doing the same thing over and over and over again.
>> > >>>>
>> > >>>> Since everyone is doing it, is there a strong reason why AVRO is liked
>> > >>>> so much? And why everyone continues doing this 3 step pattern rather than
>> > >>>> providing users with an option to output json instead and another option to
>> > >>>> output one flowfile or multiple (one per record).
>> > >>>>
>> > >>>> thanks
>> > >>>> Boris
>

Re: AVRO is the only output format with ExecuteSQL

Posted by Andrew Grande <ap...@gmail.com>.

As a side note, one has to ha e a serious justification _not_ to use
record-based processors. The benefits, including performance, are too
numerous to call out here.

Andrew

On Tue, Aug 7, 2018, 1:15 PM Mark Payne <ma...@hotmail.com> wrote:

> Boris,
>
> Using a Record-based processor does not mean that you need to define a
> schema upfront. This is
> necessary if the source itself cannot provide a schema. However, since it
> is pulling structured data
> and the schema can be inferred from the database, you wouldn't need to. As
> Matt was saying, your
> Record Writer can simply be configured to Inherit Record Schema. It can
> then write the schema to
> the "avro.schema" attribute or you can choose "Do Not Write Schema". This
> would still allow the data
> to be written in JSON, CSV, etc.
>
> You could also have the Record Writer choose to write the schema using the
> "avro.schema" attribute,
> as mentioned above, and then have any down-stream processors read the
> schema from this attribute.
> This would allow you to use any record-oriented processors you'd like
> without having to define the
> schema yourself, if you don't want to.
>
> Thanks
> -Mark
>
>
>
> On Aug 7, 2018, at 12:37 PM, Boris Tyukin <bo...@boristyukin.com> wrote:
>
> thanks for all the responses! it means I am not the only one interested in
> this topic.
>
> Record-aware version would be really nice, but a lot of times I do not
> want to use record-based processors since I need to define a schema for
> input/output upfront and just want to run SQL query and get whatever
> results back. It just adds an extra step that will be subject to
> break/support.
>
> Similar to Kafka processors, it is nice to have an option of record-based
> processor vs. message oriented processor. But if one processor can do it
> all, it is even better :)
>
>
> On Tue, Aug 7, 2018 at 9:28 AM Matt Burgess <ma...@apache.org> wrote:
>
>> I'm definitely interested in supporting a record-aware version as well
>> (I wrote the Jira up last year [1] but haven't gotten around to
>> implementing it), however I agree with Peter's comment on the Jira.
>> Since ExecuteSQL is an oft-touched processor, if we had two processors
>> that only differed in how the output is formatted, it could be harder
>> to maintain (bugs to be fixed in two places, e.g.). I think we should
>> add an optional RecordWriter property to ExecuteSQL, and the
>> documentation would reflect that if it is not set, the output will be
>> Avro with embedded schema as it has always been. If the RecordWriter
>> is set, either the schema can be hardcoded, or they can use "Inherit
>> Record Schema" even though there's no reader, and that would mimic the
>> current behavior where the schema is inferred from the database
>> columns and used for the writer. There is precedence for this pattern
>> in the SiteToSite reporting tasks.
>>
>> To Bryan's point about history, Avro at the time was the most
>> descriptive of the solutions because it maintains the schema and
>> datatypes with the data, unlike JSON, CSV, etc. Also before the record
>> readers/writers, as Bryan said, you pretty much had to split,
>> transform, merge. We just need to make that processor (and others with
>> specific input/output formats) "record-aware" for better performance.
>>
>> Regards,
>> Matt
>>
>> [1] https://issues.apache.org/jira/browse/NIFI-4517
>> On Tue, Aug 7, 2018 at 9:20 AM Bryan Bende <bb...@gmail.com> wrote:
>> >
>> > I would also add that the pattern of splitting to 1 record per flow
>> > file was common before the record processors existed, and generally
>> > this can/should be avoided now in favor of processing/manipulating
>> > records in place, and keeping them together in large batches.
>> >
>> >
>> >
>> > On Tue, Aug 7, 2018 at 9:10 AM, Andrew Grande <ap...@gmail.com>
>> wrote:
>> > > Careful, that makes too much sense, Joe ;)
>> > >
>> > >
>> > > On Tue, Aug 7, 2018, 8:45 AM Joe Witt <jo...@gmail.com> wrote:
>> > >>
>> > >> i think we just need to make an ExecuteSqlRecord processor.
>> > >>
>> > >> thanks
>> > >>
>> > >> On Tue, Aug 7, 2018, 8:41 AM Mike Thomsen <mi...@gmail.com>
>> wrote:
>> > >>>
>> > >>> My guess is that it is due to the fact that Avro is the only record
>> type
>> > >>> that can match sql pretty closely feature to feature on data types.
>> > >>> On Tue, Aug 7, 2018 at 8:33 AM Boris Tyukin <bo...@boristyukin.com>
>> > >>> wrote:
>> > >>>>
>> > >>>> I've been wondering since I started learning NiFi why ExecuteSQL
>> > >>>> processor only returns AVRO formatted data. All community examples
>> I've seen
>> > >>>> then convert AVRO to json and pretty much all of them then split
>> json to
>> > >>>> multiple flows.
>> > >>>>
>> > >>>> I found myself doing the same thing over and over and over again.
>> > >>>>
>> > >>>> Since everyone is doing it, is there a strong reason why AVRO is
>> liked
>> > >>>> so much? And why everyone continues doing this 3 step pattern
>> rather than
>> > >>>> providing users with an option to output json instead and another
>> option to
>> > >>>> output one flowfile or multiple (one per record).
>> > >>>>
>> > >>>> thanks
>> > >>>> Boris
>>
>
>

Re: AVRO is the only output format with ExecuteSQL

Posted by Boris Tyukin <bo...@boristyukin.com>.

now this is really slick! thanks Mark for educating me!

On Tue, Aug 7, 2018 at 1:15 PM Mark Payne <ma...@hotmail.com> wrote:

> Boris,
>
> Using a Record-based processor does not mean that you need to define a
> schema upfront. This is
> necessary if the source itself cannot provide a schema. However, since it
> is pulling structured data
> and the schema can be inferred from the database, you wouldn't need to. As
> Matt was saying, your
> Record Writer can simply be configured to Inherit Record Schema. It can
> then write the schema to
> the "avro.schema" attribute or you can choose "Do Not Write Schema". This
> would still allow the data
> to be written in JSON, CSV, etc.
>
> You could also have the Record Writer choose to write the schema using the
> "avro.schema" attribute,
> as mentioned above, and then have any down-stream processors read the
> schema from this attribute.
> This would allow you to use any record-oriented processors you'd like
> without having to define the
> schema yourself, if you don't want to.
>
> Thanks
> -Mark
>
>
>
> On Aug 7, 2018, at 12:37 PM, Boris Tyukin <bo...@boristyukin.com> wrote:
>
> thanks for all the responses! it means I am not the only one interested in
> this topic.
>
> Record-aware version would be really nice, but a lot of times I do not
> want to use record-based processors since I need to define a schema for
> input/output upfront and just want to run SQL query and get whatever
> results back. It just adds an extra step that will be subject to
> break/support.
>
> Similar to Kafka processors, it is nice to have an option of record-based
> processor vs. message oriented processor. But if one processor can do it
> all, it is even better :)
>
>
> On Tue, Aug 7, 2018 at 9:28 AM Matt Burgess <ma...@apache.org> wrote:
>
>> I'm definitely interested in supporting a record-aware version as well
>> (I wrote the Jira up last year [1] but haven't gotten around to
>> implementing it), however I agree with Peter's comment on the Jira.
>> Since ExecuteSQL is an oft-touched processor, if we had two processors
>> that only differed in how the output is formatted, it could be harder
>> to maintain (bugs to be fixed in two places, e.g.). I think we should
>> add an optional RecordWriter property to ExecuteSQL, and the
>> documentation would reflect that if it is not set, the output will be
>> Avro with embedded schema as it has always been. If the RecordWriter
>> is set, either the schema can be hardcoded, or they can use "Inherit
>> Record Schema" even though there's no reader, and that would mimic the
>> current behavior where the schema is inferred from the database
>> columns and used for the writer. There is precedence for this pattern
>> in the SiteToSite reporting tasks.
>>
>> To Bryan's point about history, Avro at the time was the most
>> descriptive of the solutions because it maintains the schema and
>> datatypes with the data, unlike JSON, CSV, etc. Also before the record
>> readers/writers, as Bryan said, you pretty much had to split,
>> transform, merge. We just need to make that processor (and others with
>> specific input/output formats) "record-aware" for better performance.
>>
>> Regards,
>> Matt
>>
>> [1] https://issues.apache.org/jira/browse/NIFI-4517
>> On Tue, Aug 7, 2018 at 9:20 AM Bryan Bende <bb...@gmail.com> wrote:
>> >
>> > I would also add that the pattern of splitting to 1 record per flow
>> > file was common before the record processors existed, and generally
>> > this can/should be avoided now in favor of processing/manipulating
>> > records in place, and keeping them together in large batches.
>> >
>> >
>> >
>> > On Tue, Aug 7, 2018 at 9:10 AM, Andrew Grande <ap...@gmail.com>
>> wrote:
>> > > Careful, that makes too much sense, Joe ;)
>> > >
>> > >
>> > > On Tue, Aug 7, 2018, 8:45 AM Joe Witt <jo...@gmail.com> wrote:
>> > >>
>> > >> i think we just need to make an ExecuteSqlRecord processor.
>> > >>
>> > >> thanks
>> > >>
>> > >> On Tue, Aug 7, 2018, 8:41 AM Mike Thomsen <mi...@gmail.com>
>> wrote:
>> > >>>
>> > >>> My guess is that it is due to the fact that Avro is the only record
>> type
>> > >>> that can match sql pretty closely feature to feature on data types.
>> > >>> On Tue, Aug 7, 2018 at 8:33 AM Boris Tyukin <bo...@boristyukin.com>
>> > >>> wrote:
>> > >>>>
>> > >>>> I've been wondering since I started learning NiFi why ExecuteSQL
>> > >>>> processor only returns AVRO formatted data. All community examples
>> I've seen
>> > >>>> then convert AVRO to json and pretty much all of them then split
>> json to
>> > >>>> multiple flows.
>> > >>>>
>> > >>>> I found myself doing the same thing over and over and over again.
>> > >>>>
>> > >>>> Since everyone is doing it, is there a strong reason why AVRO is
>> liked
>> > >>>> so much? And why everyone continues doing this 3 step pattern
>> rather than
>> > >>>> providing users with an option to output json instead and another
>> option to
>> > >>>> output one flowfile or multiple (one per record).
>> > >>>>
>> > >>>> thanks
>> > >>>> Boris
>>
>
>

Re: AVRO is the only output format with ExecuteSQL

Posted by Mark Payne <ma...@hotmail.com>.

Boris,

Using a Record-based processor does not mean that you need to define a schema upfront. This is
necessary if the source itself cannot provide a schema. However, since it is pulling structured data
and the schema can be inferred from the database, you wouldn't need to. As Matt was saying, your
Record Writer can simply be configured to Inherit Record Schema. It can then write the schema to
the "avro.schema" attribute or you can choose "Do Not Write Schema". This would still allow the data
to be written in JSON, CSV, etc.

You could also have the Record Writer choose to write the schema using the "avro.schema" attribute,
as mentioned above, and then have any down-stream processors read the schema from this attribute.
This would allow you to use any record-oriented processors you'd like without having to define the
schema yourself, if you don't want to.

Thanks
-Mark

On Aug 7, 2018, at 12:37 PM, Boris Tyukin <bo...@boristyukin.com>> wrote:

thanks for all the responses! it means I am not the only one interested in this topic.

Record-aware version would be really nice, but a lot of times I do not want to use record-based processors since I need to define a schema for input/output upfront and just want to run SQL query and get whatever results back. It just adds an extra step that will be subject to break/support.

Similar to Kafka processors, it is nice to have an option of record-based processor vs. message oriented processor. But if one processor can do it all, it is even better :)

On Tue, Aug 7, 2018 at 9:28 AM Matt Burgess <ma...@apache.org>> wrote:
I'm definitely interested in supporting a record-aware version as well
(I wrote the Jira up last year [1] but haven't gotten around to
implementing it), however I agree with Peter's comment on the Jira.
Since ExecuteSQL is an oft-touched processor, if we had two processors
that only differed in how the output is formatted, it could be harder
to maintain (bugs to be fixed in two places, e.g.). I think we should
add an optional RecordWriter property to ExecuteSQL, and the
documentation would reflect that if it is not set, the output will be
Avro with embedded schema as it has always been. If the RecordWriter
is set, either the schema can be hardcoded, or they can use "Inherit
Record Schema" even though there's no reader, and that would mimic the
current behavior where the schema is inferred from the database
columns and used for the writer. There is precedence for this pattern
in the SiteToSite reporting tasks.

To Bryan's point about history, Avro at the time was the most
descriptive of the solutions because it maintains the schema and
datatypes with the data, unlike JSON, CSV, etc. Also before the record
readers/writers, as Bryan said, you pretty much had to split,
transform, merge. We just need to make that processor (and others with
specific input/output formats) "record-aware" for better performance.

Regards,
Matt

[1] https://issues.apache.org/jira/browse/NIFI-4517
On Tue, Aug 7, 2018 at 9:20 AM Bryan Bende <bb...@gmail.com>> wrote:
>
> I would also add that the pattern of splitting to 1 record per flow
> file was common before the record processors existed, and generally
> this can/should be avoided now in favor of processing/manipulating
> records in place, and keeping them together in large batches.
>
>
>
> On Tue, Aug 7, 2018 at 9:10 AM, Andrew Grande <ap...@gmail.com>> wrote:
> > Careful, that makes too much sense, Joe ;)
> >
> >
> > On Tue, Aug 7, 2018, 8:45 AM Joe Witt <jo...@gmail.com>> wrote:
> >>
> >> i think we just need to make an ExecuteSqlRecord processor.
> >>
> >> thanks
> >>
> >> On Tue, Aug 7, 2018, 8:41 AM Mike Thomsen <mi...@gmail.com>> wrote:
> >>>
> >>> My guess is that it is due to the fact that Avro is the only record type
> >>> that can match sql pretty closely feature to feature on data types.
> >>> On Tue, Aug 7, 2018 at 8:33 AM Boris Tyukin <bo...@boristyukin.com>>
> >>> wrote:
> >>>>
> >>>> I've been wondering since I started learning NiFi why ExecuteSQL
> >>>> processor only returns AVRO formatted data. All community examples I've seen
> >>>> then convert AVRO to json and pretty much all of them then split json to
> >>>> multiple flows.
> >>>>
> >>>> I found myself doing the same thing over and over and over again.
> >>>>
> >>>> Since everyone is doing it, is there a strong reason why AVRO is liked
> >>>> so much? And why everyone continues doing this 3 step pattern rather than
> >>>> providing users with an option to output json instead and another option to
> >>>> output one flowfile or multiple (one per record).
> >>>>
> >>>> thanks
> >>>> Boris

Re: AVRO is the only output format with ExecuteSQL

Posted by Boris Tyukin <bo...@boristyukin.com>.

thanks for all the responses! it means I am not the only one interested in
this topic.

Record-aware version would be really nice, but a lot of times I do not want
to use record-based processors since I need to define a schema for
input/output upfront and just want to run SQL query and get whatever
results back. It just adds an extra step that will be subject to
break/support.

Similar to Kafka processors, it is nice to have an option of record-based
processor vs. message oriented processor. But if one processor can do it
all, it is even better :)


On Tue, Aug 7, 2018 at 9:28 AM Matt Burgess <ma...@apache.org> wrote:

> I'm definitely interested in supporting a record-aware version as well
> (I wrote the Jira up last year [1] but haven't gotten around to
> implementing it), however I agree with Peter's comment on the Jira.
> Since ExecuteSQL is an oft-touched processor, if we had two processors
> that only differed in how the output is formatted, it could be harder
> to maintain (bugs to be fixed in two places, e.g.). I think we should
> add an optional RecordWriter property to ExecuteSQL, and the
> documentation would reflect that if it is not set, the output will be
> Avro with embedded schema as it has always been. If the RecordWriter
> is set, either the schema can be hardcoded, or they can use "Inherit
> Record Schema" even though there's no reader, and that would mimic the
> current behavior where the schema is inferred from the database
> columns and used for the writer. There is precedence for this pattern
> in the SiteToSite reporting tasks.
>
> To Bryan's point about history, Avro at the time was the most
> descriptive of the solutions because it maintains the schema and
> datatypes with the data, unlike JSON, CSV, etc. Also before the record
> readers/writers, as Bryan said, you pretty much had to split,
> transform, merge. We just need to make that processor (and others with
> specific input/output formats) "record-aware" for better performance.
>
> Regards,
> Matt
>
> [1] https://issues.apache.org/jira/browse/NIFI-4517
> On Tue, Aug 7, 2018 at 9:20 AM Bryan Bende <bb...@gmail.com> wrote:
> >
> > I would also add that the pattern of splitting to 1 record per flow
> > file was common before the record processors existed, and generally
> > this can/should be avoided now in favor of processing/manipulating
> > records in place, and keeping them together in large batches.
> >
> >
> >
> > On Tue, Aug 7, 2018 at 9:10 AM, Andrew Grande <ap...@gmail.com>
> wrote:
> > > Careful, that makes too much sense, Joe ;)
> > >
> > >
> > > On Tue, Aug 7, 2018, 8:45 AM Joe Witt <jo...@gmail.com> wrote:
> > >>
> > >> i think we just need to make an ExecuteSqlRecord processor.
> > >>
> > >> thanks
> > >>
> > >> On Tue, Aug 7, 2018, 8:41 AM Mike Thomsen <mi...@gmail.com>
> wrote:
> > >>>
> > >>> My guess is that it is due to the fact that Avro is the only record
> type
> > >>> that can match sql pretty closely feature to feature on data types.
> > >>> On Tue, Aug 7, 2018 at 8:33 AM Boris Tyukin <bo...@boristyukin.com>
> > >>> wrote:
> > >>>>
> > >>>> I've been wondering since I started learning NiFi why ExecuteSQL
> > >>>> processor only returns AVRO formatted data. All community examples
> I've seen
> > >>>> then convert AVRO to json and pretty much all of them then split
> json to
> > >>>> multiple flows.
> > >>>>
> > >>>> I found myself doing the same thing over and over and over again.
> > >>>>
> > >>>> Since everyone is doing it, is there a strong reason why AVRO is
> liked
> > >>>> so much? And why everyone continues doing this 3 step pattern
> rather than
> > >>>> providing users with an option to output json instead and another
> option to
> > >>>> output one flowfile or multiple (one per record).
> > >>>>
> > >>>> thanks
> > >>>> Boris
>

Re: AVRO is the only output format with ExecuteSQL

Posted by Matt Burgess <ma...@apache.org>.

I'm definitely interested in supporting a record-aware version as well
(I wrote the Jira up last year [1] but haven't gotten around to
implementing it), however I agree with Peter's comment on the Jira.
Since ExecuteSQL is an oft-touched processor, if we had two processors
that only differed in how the output is formatted, it could be harder
to maintain (bugs to be fixed in two places, e.g.). I think we should
add an optional RecordWriter property to ExecuteSQL, and the
documentation would reflect that if it is not set, the output will be
Avro with embedded schema as it has always been. If the RecordWriter
is set, either the schema can be hardcoded, or they can use "Inherit
Record Schema" even though there's no reader, and that would mimic the
current behavior where the schema is inferred from the database
columns and used for the writer. There is precedence for this pattern
in the SiteToSite reporting tasks.

To Bryan's point about history, Avro at the time was the most
descriptive of the solutions because it maintains the schema and
datatypes with the data, unlike JSON, CSV, etc. Also before the record
readers/writers, as Bryan said, you pretty much had to split,
transform, merge. We just need to make that processor (and others with
specific input/output formats) "record-aware" for better performance.

Regards,
Matt

[1] https://issues.apache.org/jira/browse/NIFI-4517
On Tue, Aug 7, 2018 at 9:20 AM Bryan Bende <bb...@gmail.com> wrote:
>
> I would also add that the pattern of splitting to 1 record per flow
> file was common before the record processors existed, and generally
> this can/should be avoided now in favor of processing/manipulating
> records in place, and keeping them together in large batches.
>
>
>
> On Tue, Aug 7, 2018 at 9:10 AM, Andrew Grande <ap...@gmail.com> wrote:
> > Careful, that makes too much sense, Joe ;)
> >
> >
> > On Tue, Aug 7, 2018, 8:45 AM Joe Witt <jo...@gmail.com> wrote:
> >>
> >> i think we just need to make an ExecuteSqlRecord processor.
> >>
> >> thanks
> >>
> >> On Tue, Aug 7, 2018, 8:41 AM Mike Thomsen <mi...@gmail.com> wrote:
> >>>
> >>> My guess is that it is due to the fact that Avro is the only record type
> >>> that can match sql pretty closely feature to feature on data types.
> >>> On Tue, Aug 7, 2018 at 8:33 AM Boris Tyukin <bo...@boristyukin.com>
> >>> wrote:
> >>>>
> >>>> I've been wondering since I started learning NiFi why ExecuteSQL
> >>>> processor only returns AVRO formatted data. All community examples I've seen
> >>>> then convert AVRO to json and pretty much all of them then split json to
> >>>> multiple flows.
> >>>>
> >>>> I found myself doing the same thing over and over and over again.
> >>>>
> >>>> Since everyone is doing it, is there a strong reason why AVRO is liked
> >>>> so much? And why everyone continues doing this 3 step pattern rather than
> >>>> providing users with an option to output json instead and another option to
> >>>> output one flowfile or multiple (one per record).
> >>>>
> >>>> thanks
> >>>> Boris

Re: AVRO is the only output format with ExecuteSQL

Posted by Bryan Bende <bb...@gmail.com>.

I would also add that the pattern of splitting to 1 record per flow
file was common before the record processors existed, and generally
this can/should be avoided now in favor of processing/manipulating
records in place, and keeping them together in large batches.



On Tue, Aug 7, 2018 at 9:10 AM, Andrew Grande <ap...@gmail.com> wrote:
> Careful, that makes too much sense, Joe ;)
>
>
> On Tue, Aug 7, 2018, 8:45 AM Joe Witt <jo...@gmail.com> wrote:
>>
>> i think we just need to make an ExecuteSqlRecord processor.
>>
>> thanks
>>
>> On Tue, Aug 7, 2018, 8:41 AM Mike Thomsen <mi...@gmail.com> wrote:
>>>
>>> My guess is that it is due to the fact that Avro is the only record type
>>> that can match sql pretty closely feature to feature on data types.
>>> On Tue, Aug 7, 2018 at 8:33 AM Boris Tyukin <bo...@boristyukin.com>
>>> wrote:
>>>>
>>>> I've been wondering since I started learning NiFi why ExecuteSQL
>>>> processor only returns AVRO formatted data. All community examples I've seen
>>>> then convert AVRO to json and pretty much all of them then split json to
>>>> multiple flows.
>>>>
>>>> I found myself doing the same thing over and over and over again.
>>>>
>>>> Since everyone is doing it, is there a strong reason why AVRO is liked
>>>> so much? And why everyone continues doing this 3 step pattern rather than
>>>> providing users with an option to output json instead and another option to
>>>> output one flowfile or multiple (one per record).
>>>>
>>>> thanks
>>>> Boris

Re: AVRO is the only output format with ExecuteSQL

Posted by Andrew Grande <ap...@gmail.com>.

Careful, that makes too much sense, Joe ;)

On Tue, Aug 7, 2018, 8:45 AM Joe Witt <jo...@gmail.com> wrote:

> i think we just need to make an ExecuteSqlRecord processor.
>
> thanks
>
> On Tue, Aug 7, 2018, 8:41 AM Mike Thomsen <mi...@gmail.com> wrote:
>
>> My guess is that it is due to the fact that Avro is the only record type
>> that can match sql pretty closely feature to feature on data types.
>> On Tue, Aug 7, 2018 at 8:33 AM Boris Tyukin <bo...@boristyukin.com>
>> wrote:
>>
>>> I've been wondering since I started learning NiFi why ExecuteSQL
>>> processor only returns AVRO formatted data. All community examples I've
>>> seen then convert AVRO to json and pretty much all of them then split
>>> json to multiple flows.
>>>
>>> I found myself doing the same thing over and over and over again.
>>>
>>> Since everyone is doing it, is there a strong reason why AVRO is liked
>>> so much? And why everyone continues doing this 3 step pattern rather than
>>> providing users with an option to output json instead and another option to
>>> output one flowfile or multiple (one per record).
>>>
>>> thanks
>>> Boris
>>>
>>

Re: AVRO is the only output format with ExecuteSQL

Posted by Joe Witt <jo...@gmail.com>.

i think we just need to make an ExecuteSqlRecord processor.

thanks

On Tue, Aug 7, 2018, 8:41 AM Mike Thomsen <mi...@gmail.com> wrote:

> My guess is that it is due to the fact that Avro is the only record type
> that can match sql pretty closely feature to feature on data types.
> On Tue, Aug 7, 2018 at 8:33 AM Boris Tyukin <bo...@boristyukin.com> wrote:
>
>> I've been wondering since I started learning NiFi why ExecuteSQL
>> processor only returns AVRO formatted data. All community examples I've
>> seen then convert AVRO to json and pretty much all of them then split
>> json to multiple flows.
>>
>> I found myself doing the same thing over and over and over again.
>>
>> Since everyone is doing it, is there a strong reason why AVRO is liked so
>> much? And why everyone continues doing this 3 step pattern rather than
>> providing users with an option to output json instead and another option to
>> output one flowfile or multiple (one per record).
>>
>> thanks
>> Boris
>>
>

Re: AVRO is the only output format with ExecuteSQL

Posted by Mike Thomsen <mi...@gmail.com>.

My guess is that it is due to the fact that Avro is the only record type
that can match sql pretty closely feature to feature on data types.
On Tue, Aug 7, 2018 at 8:33 AM Boris Tyukin <bo...@boristyukin.com> wrote:

> I've been wondering since I started learning NiFi why ExecuteSQL processor
> only returns AVRO formatted data. All community examples I've seen then
> convert AVRO to json and pretty much all of them then split json to
> multiple flows.
>
> I found myself doing the same thing over and over and over again.
>
> Since everyone is doing it, is there a strong reason why AVRO is liked so
> much? And why everyone continues doing this 3 step pattern rather than
> providing users with an option to output json instead and another option to
> output one flowfile or multiple (one per record).
>
> thanks
> Boris
>