You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@nifi.apache.org by Koji Kawamura <ij...@gmail.com> on 2017/09/12 01:05:38 UTC

Re: QueryDatabaseTable - Schema

Hi Uwe,

I had a similar expectation when I was using QueryDatabaseTable or any
other processor creating Avro FlowFile which has its schema embedded,
combining new record reader/writer controllers.

Now, NiFi has "Inherit Record Schema" option as "Schema Access
Strategy" of RecordWriter, already merged in master branch.
https://issues.apache.org/jira/browse/NIFI-3921

I was able to reuse the Avro schema at subsequent flow using "Inherit
Record Schema", it's really useful. You can construct a flow like
below:

- QueryDatabaseTable
  - outputs FlowFile with Avro schema embedded
- ConvertRecord
  - AvroReader:
    - "Schema Access Strategy" = "Use Embedded Avro Schema"
  - CSVRecordSetWriter:
    - "Schema Access Strategy" = "Inherit Record Schema"
    - "Schema Write Strategy" = "Set 'avro.schema' Attribute"

This way, you don't have to have the schema in registry, and result
CSV FlowFile has 'avro.schema' attribute inheriting the one created by
QueryDatabaseTable.

Hope this helps.

Thanks,
Koji

On Tue, Sep 12, 2017 at 5:02 AM, Uwe Geercken <uw...@web.de> wrote:
> Hello,
>
> I was wondering why if the QueryDatabaseTable processor creates internally
> an Avro schema, why is this schema not available as an attribute or saved to
> the registry?
>
> If it would, then one could reuse the schema. E.g. if I use the
> ConvertRecord processor and I specify an AvroReader as RecordReader, then
> this Reader will take the schema from the flowfile the QueryDatabaseTable
> processor creates. But the RecordWriter in the ConvertRecord - in my example
> a CSVRecordSetWriter requires the schema as an attribute or as a reference
> to the schema registry.
>
> I can see there is an ExtractAvroSchema processor but I don'see there is a
> way of combining the metadata into e.g. the ConvertRecord processor.
>
> Any help or ideas?
>
> Rgds,
>
> Uwe

Re: QueryDatabaseTable - Schema

Posted by Matt Burgess <ma...@gmail.com>.
One thing about its limitations had to do with timing, the record-aware stuff happened after QDT. Would be great to have QDT use a record writer, then depending on the writer you could choose your schema output strategy as Koji outlined.

I'm not sure if there is a JIRA for this or not (or any of the other earlier processors like ExecuteSQL), but we can certainly go after record-aware versions of them to get all the goodness therein.

Regards,
Matt 

Sent from my iPhone

> On Sep 11, 2017, at 9:05 PM, Koji Kawamura <ij...@gmail.com> wrote:
> 
> Hi Uwe,
> 
> I had a similar expectation when I was using QueryDatabaseTable or any
> other processor creating Avro FlowFile which has its schema embedded,
> combining new record reader/writer controllers.
> 
> Now, NiFi has "Inherit Record Schema" option as "Schema Access
> Strategy" of RecordWriter, already merged in master branch.
> https://issues.apache.org/jira/browse/NIFI-3921
> 
> I was able to reuse the Avro schema at subsequent flow using "Inherit
> Record Schema", it's really useful. You can construct a flow like
> below:
> 
> - QueryDatabaseTable
>  - outputs FlowFile with Avro schema embedded
> - ConvertRecord
>  - AvroReader:
>    - "Schema Access Strategy" = "Use Embedded Avro Schema"
>  - CSVRecordSetWriter:
>    - "Schema Access Strategy" = "Inherit Record Schema"
>    - "Schema Write Strategy" = "Set 'avro.schema' Attribute"
> 
> This way, you don't have to have the schema in registry, and result
> CSV FlowFile has 'avro.schema' attribute inheriting the one created by
> QueryDatabaseTable.
> 
> Hope this helps.
> 
> Thanks,
> Koji
> 
>> On Tue, Sep 12, 2017 at 5:02 AM, Uwe Geercken <uw...@web.de> wrote:
>> Hello,
>> 
>> I was wondering why if the QueryDatabaseTable processor creates internally
>> an Avro schema, why is this schema not available as an attribute or saved to
>> the registry?
>> 
>> If it would, then one could reuse the schema. E.g. if I use the
>> ConvertRecord processor and I specify an AvroReader as RecordReader, then
>> this Reader will take the schema from the flowfile the QueryDatabaseTable
>> processor creates. But the RecordWriter in the ConvertRecord - in my example
>> a CSVRecordSetWriter requires the schema as an attribute or as a reference
>> to the schema registry.
>> 
>> I can see there is an ExtractAvroSchema processor but I don'see there is a
>> way of combining the metadata into e.g. the ConvertRecord processor.
>> 
>> Any help or ideas?
>> 
>> Rgds,
>> 
>> Uwe

Re: Re: QueryDatabaseTable - Schema

Posted by Koji Kawamura <ij...@gmail.com>.
Hi Uwe,

Answer to Q1 and Q2:

I agree with you, I confused at the first time when I tried to
understand NiFi record and schema ... etc.
To understand, you need to get familiar with following concepts.
Grouping keywords and concepts as follows might help you to grasp how
each related:

- NiFi Data models:
  - FlowFile: used to pass data around NiFi flow, has String key/value
pairs "Attributes" and opaque binary ''Content". Some serialization
mechanism, such as Avro can embed schema within the content.
  - Record: represents a data unit consists of multiple fields, its
structure is defined by "Schema". Used at each NiFi components such as
Processor, Controller Service ... etc. Resides on heap, when it's
passed to next component, it's Serialized/Deserialized by RecordReader
and Writer from/to various data format, CSV, JSON, XML, Avro ... etc.

- RecordReader
-- How to retrieve schema of an incoming "FlowFile" (Schema Access Strategy)
---- Embedded schema, from content binary serialized as Avro with
Schema embedded
---- Schema text attribute, FlowFile contains String representation of a Schema
---- Schema Registry: Hortonworks, Confluent, AvroSchema

- RecordWriter
-- How to retrieve schema of a processed "Record" (Schema Access Strategy)
---- Inherit schema (from the processed Record). Useful for components
like ConvertRecord, because it already knows a schema of the record
being processed. No need to retrieve schema.
---- Schema text attribute
---- Schema Registry
-- How to write schema of a processed "Record" (Schema Write Strategy)
---- Embedded schema, write the schema within output FlowFile's content
---- Schema text attribute, put Schema to output FlowFile's Attribute
---- Schema Registry, put reference keys to output FlowFile's Attribute

Answer for Q3

In order to debug schema, use 'Schema text attribute' Schema Write
Strategy might be the easiest option, then you can see schema of a
FlowFile from NiFi UI as FlowFile Attribute.

I'm not aware of any module that we can write to external schema
registry at the moment.

Thanks,
Koji

On Wed, Sep 13, 2017 at 3:00 AM, Uwe Geercken <uw...@web.de> wrote:
> Thank you Koji!
>
> that is good news. But I have 3 questions:
>
> 1. You quote  Bryan Bende: "When a reader produces a record it attaches the
> schema it used to the record...": What happens here exactly? Is the schema
> attached to the flowfile? Is it an attribute?
>
> 2. I can not see an exact definition of what "inherit" means. It may be
> linked to my question above though. I am a bit puzzled of the use of
> "embedded" versus "inherit". Does it not mean "embedded" in both cases? If
> it really means inherit, from where does it inherit? or can I choose it?
>
> 3. What if I do really want to save the schema of e.g. a database table or
> file to the registry. I don't know maybe as a reference or for debugging.
> How would I do that (I mean: not manually)?
>
>
> From the first look I found Nifi a kick-ass tool. It continues to evolve
> very fast and I use it at work for smaller things. Now I want to start to
> use it for more challenging things such as feeding kafka and maybe also
> hadoop. So I am experimenting a lot and want to find the best possible
> setup.
>
> Greetings ans thanks again.
>
> Uwe
>
>
> Gesendet: Dienstag, 12. September 2017 um 03:05 Uhr
> Von: "Koji Kawamura" <ij...@gmail.com>
> An: users@nifi.apache.org
> Betreff: Re: QueryDatabaseTable - Schema
> Hi Uwe,
>
> I had a similar expectation when I was using QueryDatabaseTable or any
> other processor creating Avro FlowFile which has its schema embedded,
> combining new record reader/writer controllers.
>
> Now, NiFi has "Inherit Record Schema" option as "Schema Access
> Strategy" of RecordWriter, already merged in master branch.
> https://issues.apache.org/jira/browse/NIFI-3921
>
> I was able to reuse the Avro schema at subsequent flow using "Inherit
> Record Schema", it's really useful. You can construct a flow like
> below:
>
> - QueryDatabaseTable
> - outputs FlowFile with Avro schema embedded
> - ConvertRecord
> - AvroReader:
> - "Schema Access Strategy" = "Use Embedded Avro Schema"
> - CSVRecordSetWriter:
> - "Schema Access Strategy" = "Inherit Record Schema"
> - "Schema Write Strategy" = "Set 'avro.schema' Attribute"
>
> This way, you don't have to have the schema in registry, and result
> CSV FlowFile has 'avro.schema' attribute inheriting the one created by
> QueryDatabaseTable.
>
> Hope this helps.
>
> Thanks,
> Koji
>
> On Tue, Sep 12, 2017 at 5:02 AM, Uwe Geercken <uw...@web.de> wrote:
>> Hello,
>>
>> I was wondering why if the QueryDatabaseTable processor creates internally
>> an Avro schema, why is this schema not available as an attribute or saved
>> to
>> the registry?
>>
>> If it would, then one could reuse the schema. E.g. if I use the
>> ConvertRecord processor and I specify an AvroReader as RecordReader, then
>> this Reader will take the schema from the flowfile the QueryDatabaseTable
>> processor creates. But the RecordWriter in the ConvertRecord - in my
>> example
>> a CSVRecordSetWriter requires the schema as an attribute or as a reference
>> to the schema registry.
>>
>> I can see there is an ExtractAvroSchema processor but I don'see there is a
>> way of combining the metadata into e.g. the ConvertRecord processor.
>>
>> Any help or ideas?
>>
>> Rgds,
>>
>> Uwe