You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nifi.apache.org by "Peter Wicks (pwicks)" <pw...@micron.com> on 2017/05/23 05:13:15 UTC
RE: [EXT] Re: NiFi 1.2.0 Record processors question

I appreciate the clarification as well. I was really confused why my Avro files weren't converting, and this explains it; though I have to say the error messages you run into in this scenario are not clear.

-----Original Message-----
From: Koji Kawamura [mailto:ijokarumawak@gmail.com] 
Sent: Monday, May 22, 2017 9:59 AM
To: dev <de...@nifi.apache.org>
Subject: [EXT] Re: NiFi 1.2.0 Record processors question

I've updated the JIRA description to cover not only embedded Avro schema but also ones such as derived from CSVReader.
https://issues.apache.org/jira/browse/NIFI-3921

Thanks,
Koji

On Sat, May 20, 2017 at 4:14 AM, Joe Gresock <jg...@gmail.com> wrote:
> Yes, both of your examples help explain the use of the CSV header parsing.
>
> I think I have a much better understanding of the new framework now, 
> thanks to Bryan and Matt.  Mission accomplished!
>
> On Fri, May 19, 2017 at 7:04 PM, Bryan Bende <bb...@gmail.com> wrote:
>
>> When a reader produces a record it attaches the schema it used to the 
>> record, but we currently don't have a way for a writer to use that 
>> schema when writing a record, although I think we do want to support 
>> that... something like a "Use Schema in Record" option as a choice in 
>> the 'Schema Access Strategy' of writers.
>>
>> For now, when a processor uses a reader and a writer, and you also 
>> want to read and write with the same schema, then you would still 
>> have to define the same schema for the writer to use even if you had 
>> a CSV reader that inferred the schema from the headers.
>>
>> There are some processors that only use a reader, like 
>> PutDabaseRecord, where using the CSV header would still be helpful.
>>
>> There are also a lot of cases where you where you would write with a 
>> different schema then you read with, so using the CSV header for 
>> reading is still helpful in those cases too.
>>
>> Hopefully I am making sense and not confusing things more.
>>
>>
>> On Fri, May 19, 2017 at 1:27 PM, Joe Gresock <jg...@gmail.com> wrote:
>> > Matt,
>> >
>> > Great response, this does help explain a lot.  Reading through your 
>> > post made me realize I didn't understand the AvroSchemaRegistry.  I 
>> > had been thinking it was something that nifi processors populated, 
>> > but I re-read
>> its
>> > usage description and it does indeed say to use dynamic properties 
>> > for
>> the
>> > schema name / value.  In that case, I can definitely see how this 
>> > is not dynamic in the sense of inferring any schemas on the flow.  
>> > It makes me wonder if there would be a way to populate the schema 
>> > registry from flow files.  When I first glanced at the processors, 
>> > I had assumed that when
>> the
>> > schema was inferred from the CSV headers, it would create an entry 
>> > in the AvroSchemaRegistry, provided you filled in the correct properties.
>> Clearly
>> > this is not how it works.
>> >
>> > Just so I understand, does your first paragraph mean that even if 
>> > you use the CSV headers to determine the schema, you still can't 
>> > use the *Record processors unless you manually register a matching 
>> > schema in the schema registry, or otherwise somehow set the schema 
>> > in an attribute?  In this case, it almost seems like inferring the 
>> > schema from the CSV headers
>> serves
>> > no purpose, and I don't see how NIFI-3921 would alleviate that (it 
>> > only appears to address avro flow files with embedded schema).
>> >
>> > Based on this understanding, I was able to successfully get the 
>> > following flow working:
>> > InferAvroSchema -> QueryRecord.
>> >
>> > QueryRecord uses CSVReader with "Use Schema Text Property" and 
>> > Schema
>> Text
>> > set to ${inferred.avro.schema} (which is populated by the 
>> > InferAvroSchema processor).  It also uses JsonRecordSetWriter with 
>> > a similar configuration.  I could attach a template, but I don't 
>> > know the best way
>> to
>> > do that on the listserve.
>> >
>> > Joe
>> >
>> > On Fri, May 19, 2017 at 4:59 PM, Matt Burgess 
>> > <ma...@apache.org>
>> wrote:
>> >
>> >> Joe,
>> >>
>> >> Using the CSV Headers to determine the schema is currently the 
>> >> only "dynamic" schema strategy, so it will be tricky to use with 
>> >> the other Readers/Writers and associated processors (which require 
>> >> an explicit schema). This should be alleviated with NIFI-3291 [1].  
>> >> For this first release, I believe the approach would be to 
>> >> identify the various schemas for your incoming/outgoing data, 
>> >> create a Schema Registry with all of them, then the various Record Readers/Writers using those.
>> >>
>> >> There were some issues during development related to using the 
>> >> incoming vs outgoing schema for various record operations, if 
>> >> QueryRecord seems to be using the output schema for querying then 
>> >> it is likely a bug. However in this case it might just be that you 
>> >> need an explicit schema for your Writer that matches the input 
>> >> schema (which is inferred from the CSV header). The CSV Header 
>> >> inference currently assumes all fields are Strings, so a nominal 
>> >> schema would have the same number of fields as columns, each with 
>> >> type String. If you don't know the number of columns and/or the 
>> >> column names are dynamic per CSV file, I believe we have a gap here (for now).
>> >>
>> >> I thought of maybe having a InferRecordSchema processor that would 
>> >> fill in the avro.text attribute for use in various downstream 
>> >> record readers/writers, but inferring schemas in general is not a 
>> >> trivial task. An easier interim solution might be to have an 
>> >> AddSchemaAsAttribute processor, which takes a Reader to parse the 
>> >> records and determine the schema (whether dynamic or static), and 
>> >> set the avro.text attribute on the original incoming flow file, 
>> >> then transfer the original flow file. This would require two 
>> >> reads, one by AddSchemaAsAttribute and one by the downstream 
>> >> record processor, but it should be fairly easy to implement.  Then 
>> >> again, since new features would go into 1.3.0, hopefully NIFI-3921 
>> >> will be implemented by then, rendering all this moot :)
>> >>
>> >> Regards,
>> >> Matt
>> >>
>> >> [1] https://issues.apache.org/jira/browse/NIFI-3921
>> >>
>> >> On Fri, May 19, 2017 at 12:25 PM, Joe Gresock <jg...@gmail.com>
>> wrote:
>> >> > I've tried a couple different configurations of CSVReader / 
>> >> > JsonRecordSetWriter with the QueryRecord processor, and I don't 
>> >> > think
>> I
>> >> > quite have the usage down yet.
>> >> >
>> >> > Can someone give a basic example configuration in the following 
>> >> > 2 scenarios?  I followed most of Matt Burgess's response to the 
>> >> > post
>> titled
>> >> > "How to use ConvertRecord Processor", but I don't think it tells 
>> >> > the
>> >> whole
>> >> > story.
>> >> >
>> >> > 1) QueryRecord, converting CSV to JSON, using only the CSV 
>> >> > headers to determine the schema.  (I tried selecting Use String 
>> >> > Fields from
>> Header
>> >> in
>> >> > CSVReader, but the processor really seems to want to use the 
>> >> > JsonRecordSetWriter to determine the schema)
>> >> >
>> >> > 2) QueryRecord, converting CSV to JSON, using a cached avro 
>> >> > schema.  I imagine I need to use InferAvroSchema here, but I'm 
>> >> > not sure how to
>> cache
>> >> > it in the AvroSchemaRegistry.  Also not quite sure how to 
>> >> > configure
>> the
>> >> > properties of each controller service in this case.
>> >> >
>> >> > Any help would be appreciated.
>> >> >
>> >> > Joe
>> >> >
>> >> > --
>> >> > I know what it is to be in need, and I know what it is to have
>> plenty.  I
>> >> > have learned the secret of being content in any and every 
>> >> > situation, whether well fed or hungry, whether living in plenty 
>> >> > or in want.  I
>> can
>> >> do
>> >> > all this through him who gives me strength.    *-Philippians 4:12-13*
>> >>
>> >
>> >
>> >
>> > --
>> > I know what it is to be in need, and I know what it is to have 
>> > plenty.  I have learned the secret of being content in any and 
>> > every situation, whether well fed or hungry, whether living in 
>> > plenty or in want.  I can
>> do
>> > all this through him who gives me strength.    *-Philippians 4:12-13*
>>
>
>
>
> --
> I know what it is to be in need, and I know what it is to have plenty.  
> I have learned the secret of being content in any and every situation, 
> whether well fed or hungry, whether living in plenty or in want.  I can do
> all this through him who gives me strength.    *-Philippians 4:12-13*