You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nifi.apache.org by Emanuel Oliveira <em...@gmail.com> on 2020/02/01 17:32:57 UTC

basic enhancements + incomplete processors + when nifi gets more complicated than it should..

Hi,

Based on recent experience, I found very hard to implement logic which i
think should exists out of the box, and instead it was slow process of
keeping discovering a property on a processor only works for a type of data
when processor supports multiple types etc.

I would like you all to keep it simple attitude and imagine hwo you would
implement a basic scenario as:

*basic scenario 1 - shall be easy to implement out of the box following 3
needs:*
CSV (*get schema automatically via header line*) --> *validate mandatory
subset of fields (presence) and (data types)* --> *export subset of fields*
or all (but want some of them obfuscated)
problems/workarounds found 1.9 rc3

*1. processor ValidateRecord*
[1.1] *OK* - allows *getting schema automatically via header line* and
*mandatory
subset of fields* (presence) via the 3 schema properties --> suggest rename
properties to make clear those at processor level are "mandatory check" vs
the schema on reader which is the well the data read schema.
[1.2] *NOK* - does not allow *types validation**.* *One could thinking
using InferSchema right ? problem is it only supports JSON.*
[1.2] *NOK* - ignores writer schema where one could supply *subset of
original fields* (always export all original fields) --> add property to
control export all fields (default) or use writer schema(with subset).

*2. processor ConvertRecord*
[2.1] *OK* csvreader able to *get schema from header -*-> maybe improve/add
property to cleanup fields (regex search/replace - so we can strip
whitespaces and anything else that breaks nifi processors and/or that
doesnt interest us)
[2.2] *NOK* missing *mandatory subset of fields.*
[2.3] *OK* but does good jobs converting between formats, and/or *export
all or subset of fields via writer schema*.

*3. processor InferAvroSchema*
[3.1] NOK - despite property "Input Content Type" lists CSV, JSON as
inbound data, in reality the property "Number Of Records To Analyze" only
supports JSON. Took us 2 days debugging to understand the problem.. 1 CSV
with 4k lines and mostly nulls, "1"s or "2"s but some few records would be
"true" or "false".. meaning avro data type should have been [null, string]
but no.. as we found out, type kept being [null, long] with doc always
using 1st data line in CSV to determine field type. This was VERY scaring
to find out.. how can it be this was fully working as expected ? We endup
needing to add +1 processor to convert CSV into JSON so we could get proper
schema.. and even now we still testing, as seems all fields got [string]
when some columns should be long.

Im not sure the best way to expose this, but im working at enterprise
level, and believe me, this small but critical nuances are starting to push
the mood on NiFi.
But because I felt in love with NiFi and i like the idea of graphical
design of flows etc, but we really must fix this critical little devils..
they are being screamed as nifi problems at management level.
I know nifi is open source, and its upon us developers to improve, i just
would like to call attention that we must be sure on the middle of PRs and
JIRA enhancements we not forgetting the basic threshold.. doesn't make
sense to release a processor with only 50% of its main goal developed when
the remaining work would be easy and fast to do (aka InferAvroSchema).

As i keep experimenting more and more with NiFi, i start detecting the
level of basic quality features is bellow from what i think it should be.
Better not release incomplete processors at least regarding core function
of the processor.

I know developers can contributes with new code, fixes and enhancements..
but is there any gatekeeper team double checking the deliverables ? like at
basic developer should provide enough unite tests.. again the
InferAvroSchema being a processor to export avro schema based on either a
CSV or JSON, then obviously there should be couple unit testings CSVs and
JSON with different data so we can be sure sure we have the proper type on
the avro schema exported right ?

Above i share some ideas, and i got much more from my day by day experience
that i been working with NiFi at entperise level for more than 1 year by
now.
Let me know what shall be the way to create JIRAs to fix several processors
in order to allow aone unexperienced nifi client developer to accomplish
the basic flow of:

CSV (*get schema automatically via header line*) --> *validate mandatory
subset of fields (presence) and (data types)* --> *export subset of fields*
or all (but want some of them obfuscated)

I challenge anyone to come out with flows to implement this basic flow..
and test and see what i mean,, you will see how incomplete and hard are
things.. which should not be the case at all. NiFi shall be true Lego, add
processors that says does XPTO and trust it will.. but we keep finding a
lot of nuances..

I dont mind taking 1 day off my  and work have a meeting with some of you -
dont know if theres such a thing as tech lead on nifi project? - and i
think would be urgent to fix the foundations of some processors. Let me
know..



Best Regards,
*Emanuel Oliveira*

Re: basic enhancements + incomplete processors + when nifi gets more complicated than it should..

Posted by Mike Thomsen <mi...@gmail.com>.

**1. processor ValidateRecord*[1.1] *OK* - allows *getting schema
automatically via header line* and*mandatorysubset of fields* (presence)
via the 3 schema properties --> suggest renameproperties to make clear
those at processor level are "mandatory check" vsthe schema on reader which
is the well the data read schema.*

We cannot easily rename fields and things like that because we have a
policy of preferring strong backward compatibility between major releases.
Many of us who are committers and on the PMC have felt this pain too as
we've looked at a feature that's gone in and wanted to make a tweak, but
the goal is to ensure that there are as few gotchas as possible for data
engineers already invested in published features and behaviors. We really
want to produce as little inconvenience there as possible.

*[1.2] *NOK* - does not allow *types validation**.* *One could
thinkingusing InferSchema right ? problem is it only supports JSON.**

We'd need more information about this particular issue you're having. I
find it hard to believe that ValidateRecord is not capable of validating
types using a supplied Avro schema. If you are referring to the issues with
validating inferred schema fields against your preferred types, when the
original data is a CSV file, I don't think that's something that can be
implementd in a sane way because CSV has no concept of types other than
strings. In fact, trying to guess types with CSV would a) usually be wrong
and b) often result in subtly horrible issues with downstream systems (this
is especially true with more forgiving NoSQL systems).

*[1.2] *NOK* - ignores writer schema where one could supply *subset
oforiginal fields* (always export all original fields) --> add property
tocontrol export all fields (default) or use writer schema(with subset).*

I think what is happening here is that ValidateRecord is only using the
reader schema to validate the record because that is the most natural place
for the validation to happen. I agree that it can be unintuitive to allow a
writer with a different schema, but you should be fine if you use a schema
on the reader that matches your intent with respect to the validation.

**2. processor ConvertRecord*[2.1] *OK* csvreader able to *get schema from
header -*-> maybe improve/addproperty to cleanup fields (regex
search/replace - so we can stripwhitespaces and anything else that breaks
nifi processors and/or thatdoesnt interest us)*

We have UpdateRecord for this particular use case. In fact, not long ago we
added the trim function for record path operations to enable this use case.
See documentation here:
http://nifi.apache.org/docs/nifi-docs/html/record-path-guide.html

As a rule of thumb, we are working hard to keep a strong separation of
concerns in the components so that their behavior is simple and
predictable. Mutating records on read from a reader would be a very serious
violation of that principle and cause potentially a lot of grief for other
users.

I will say that in general, I agree a "one stop trim" function for record
sets would be particularly useful. I'll try to draft a Jira ticket to
capture that story because a mass "trim all strings" (with record paths for
exceptions) could be pretty useful.

*[2.2] *NOK* missing *mandatory subset of fields.**

The CSV Reader and Writer are designed to delegate to the schema
registration system, with optional support for hard-coding in a single
schema. In order to keep it simple, I don't think adding a layer on top of
schema validation for enforcement of mandatory fields is a good idea. It's
already really simple to enforce mandatory field presence in an Avro
schema, and doing it there keeps flows simple and flexible.

*[2.3] *OK* but does good jobs converting between formats, and/or
*exportall or subset of fields via writer schema*.*3. processor
InferAvroSchema*[3.1] NOK - despite property "Input Content Type" lists
CSV, JSON asinbound data, in reality the property "Number Of Records To
Analyze" onlysupports JSON. Took us 2 days debugging to understand the
problem.. 1 CSVwith 4k lines and mostly nulls, "1"s or "2"s but some few
records would be"true" or "false".. meaning avro data type should have been
[null, string]but no.. as we found out, type kept being [null, long] with
doc alwaysusing 1st data line in CSV to determine field type. This was VERY
scaringto find out.. how can it be this was fully working as expected ? We
endupneeding to add +1 processor to convert CSV into JSON so we could get
properschema.. and even now we still testing, as seems all fields got
[string]when some columns should be long.*

I believe this delegates to the Kite SDK, which is a third party project.
Therefore there might not be much we can do unless someone in the community
is willing to do a deep dive there and try to address shortcomings.

On Sun, Feb 2, 2020 at 9:32 PM Mike Thomsen <mi...@gmail.com> wrote:

> Hi Emanuel,
>
> I think you raise some potentially valid issues that are worth looking at
> in more detail. I can say our experience with NiFi is exact opposite, but
> part of that is that we are a 100% "schema first" shop. Avro is insanely
> easy to learn, and we've gotten junior data engineers up to speed in a
> matter of days producing beta quality data contracts that way.
>
> On Sat, Feb 1, 2020 at 12:33 PM Emanuel Oliveira <em...@gmail.com>
> wrote:
>
>> Hi,
>>
>> Based on recent experience, I found very hard to implement logic which i
>> think should exists out of the box, and instead it was slow process of
>> keeping discovering a property on a processor only works for a type of
>> data
>> when processor supports multiple types etc.
>>
>> I would like you all to keep it simple attitude and imagine hwo you would
>> implement a basic scenario as:
>>
>> *basic scenario 1 - shall be easy to implement out of the box following 3
>> needs:*
>> CSV (*get schema automatically via header line*) --> *validate mandatory
>> subset of fields (presence) and (data types)* --> *export subset of
>> fields*
>> or all (but want some of them obfuscated)
>> problems/workarounds found 1.9 rc3
>>
>> *1. processor ValidateRecord*
>> [1.1] *OK* - allows *getting schema automatically via header line* and
>> *mandatory
>> subset of fields* (presence) via the 3 schema properties --> suggest
>> rename
>> properties to make clear those at processor level are "mandatory check" vs
>> the schema on reader which is the well the data read schema.
>> [1.2] *NOK* - does not allow *types validation**.* *One could thinking
>> using InferSchema right ? problem is it only supports JSON.*
>> [1.2] *NOK* - ignores writer schema where one could supply *subset of
>> original fields* (always export all original fields) --> add property to
>> control export all fields (default) or use writer schema(with subset).
>>
>> *2. processor ConvertRecord*
>> [2.1] *OK* csvreader able to *get schema from header -*-> maybe
>> improve/add
>> property to cleanup fields (regex search/replace - so we can strip
>> whitespaces and anything else that breaks nifi processors and/or that
>> doesnt interest us)
>> [2.2] *NOK* missing *mandatory subset of fields.*
>> [2.3] *OK* but does good jobs converting between formats, and/or *export
>> all or subset of fields via writer schema*.
>>
>> *3. processor InferAvroSchema*
>> [3.1] NOK - despite property "Input Content Type" lists CSV, JSON as
>> inbound data, in reality the property "Number Of Records To Analyze" only
>> supports JSON. Took us 2 days debugging to understand the problem.. 1 CSV
>> with 4k lines and mostly nulls, "1"s or "2"s but some few records would be
>> "true" or "false".. meaning avro data type should have been [null, string]
>> but no.. as we found out, type kept being [null, long] with doc always
>> using 1st data line in CSV to determine field type. This was VERY scaring
>> to find out.. how can it be this was fully working as expected ? We endup
>> needing to add +1 processor to convert CSV into JSON so we could get
>> proper
>> schema.. and even now we still testing, as seems all fields got [string]
>> when some columns should be long.
>>
>> Im not sure the best way to expose this, but im working at enterprise
>> level, and believe me, this small but critical nuances are starting to
>> push
>> the mood on NiFi.
>> But because I felt in love with NiFi and i like the idea of graphical
>> design of flows etc, but we really must fix this critical little devils..
>> they are being screamed as nifi problems at management level.
>> I know nifi is open source, and its upon us developers to improve, i just
>> would like to call attention that we must be sure on the middle of PRs and
>> JIRA enhancements we not forgetting the basic threshold.. doesn't make
>> sense to release a processor with only 50% of its main goal developed when
>> the remaining work would be easy and fast to do (aka InferAvroSchema).
>>
>> As i keep experimenting more and more with NiFi, i start detecting the
>> level of basic quality features is bellow from what i think it should be.
>> Better not release incomplete processors at least regarding core function
>> of the processor.
>>
>> I know developers can contributes with new code, fixes and enhancements..
>> but is there any gatekeeper team double checking the deliverables ? like
>> at
>> basic developer should provide enough unite tests.. again the
>> InferAvroSchema being a processor to export avro schema based on either a
>> CSV or JSON, then obviously there should be couple unit testings CSVs and
>> JSON with different data so we can be sure sure we have the proper type on
>> the avro schema exported right ?
>>
>> Above i share some ideas, and i got much more from my day by day
>> experience
>> that i been working with NiFi at entperise level for more than 1 year by
>> now.
>> Let me know what shall be the way to create JIRAs to fix several
>> processors
>> in order to allow aone unexperienced nifi client developer to accomplish
>> the basic flow of:
>>
>> CSV (*get schema automatically via header line*) --> *validate mandatory
>> subset of fields (presence) and (data types)* --> *export subset of
>> fields*
>> or all (but want some of them obfuscated)
>>
>> I challenge anyone to come out with flows to implement this basic flow..
>> and test and see what i mean,, you will see how incomplete and hard are
>> things.. which should not be the case at all. NiFi shall be true Lego, add
>> processors that says does XPTO and trust it will.. but we keep finding a
>> lot of nuances..
>>
>> I dont mind taking 1 day off my  and work have a meeting with some of you
>> -
>> dont know if theres such a thing as tech lead on nifi project? - and i
>> think would be urgent to fix the foundations of some processors. Let me
>> know..
>>
>>
>>
>> Best Regards,
>> *Emanuel Oliveira*
>>
>

Re: basic enhancements + incomplete processors + when nifi gets more complicated than it should..

Posted by Mike Thomsen <mi...@gmail.com>.

Forgot to mention, that I recently wrote and deployed a service that uses
that to take POJOs and expose a schema to third party clients that they can
use with the Avro APIs to send valid JSON that can be ingested into our
Kafka queues.

On Mon, Feb 3, 2020 at 6:06 PM Mike Thomsen <mi...@gmail.com> wrote:

> Hi Emanuel,
>
> > This may look simple use case, but very hard to implement due.. but
> please
> > do surprise me with sequence of processors needed to implement what i
> think
> > *its a great real world example of data quality*
>
> I think this is the root of the problem. Personally, I wouldn't
> characterize anything that relies on schema inference over a schema first
> design as a "great real world example of data quality" because getting real
> data quality takes a lot of hard data engineering work in an enterprise
> environment. As the saying goes, there ain't no such thing as a free lunch.
>
> Now, if you want to generate schemas in a robust way, here's one way I
> know tends to yield good results:
>
> https://github.com/FasterXML/jackson-dataformats-binary/tree/master/avro
>
> It will take a POJO and generate an Avro schema from, and since Java is
> fairly strongly typed language you just need to massage things a little
> with some annotations to get certain nuances like I think
> javax.validation.Nullable will automatically make a field nullable.
>
> On Mon, Feb 3, 2020 at 1:59 PM Emanuel Oliveira <em...@gmail.com>
> wrote:
>
>> Hi Mike,
>>
>> Let me summarize as i see my long post is not pads ng the clean easy
>> message i intended:
>> *processor I**nferAvroSchema*:
>> - should retrieve types from analysing data from csv. property "Input
>> Content Type" lists CSV, JSON but in reality the property "Number Of
>> Records To Analyze" only works with Json. With CSV all types are strings..
>> Not hard to detect if a field only contains digits or alphanumerics, only
>> timestamps could need extra property to help with format (or out of the
>> box
>> just also detect timestamps as well.. not hard).
>>
>> *Mandatory subset of fields verification:*
>> ValidateRecord allows optional 3 schema properties (outside reader and
>> writer) to supply an avro schema to balidate mandatory subset of fields -
>> but - ConvertRecord doesn't allow this.
>>
>>
>> Finally i would like to request your suggestion for following use
>> case(same
>> we struggled):
>> - given 1 csv with header line listing 100 fields we want:
>> --- validate mandatory fields (just 1 or 2 fields).
>> --- automatic create avroschema based on data lines.
>> ---export avro like this:
>> ------ some fields obfuscated + remaining fields not obfuscated (or the
>> other way around: some fields not obfuscated + remaining fields
>> obfuscated). And of course header line stay in line with final fields
>> order.
>>
>> This may look simple use case, but very hard to implement due.. but please
>> do surprise me with sequence of processors needed to implement what i
>> think
>> its a great real world example of data quality (mandatory fields + parcial
>> obfuscation + export as different format and just subset of the fields and
>> where some obfuscated and others not).
>>
>> Thanks and hope mote clear, im sure this will help more dev teams.
>>
>> Cheers,
>> Emanuel
>>
>>
>>
>>
>> On Mon 3 Feb 2020, 13:50 Mike Thomsen, <mi...@gmail.com> wrote:
>>
>> > One thing I should mention is that schema inference is simply not
>> capable
>> > of exploiting Avro's field aliasing. That's an incredibly powerful
>> feature
>> > that allows you to reconcile data sets without writing a single line of
>> > code. For example, I wrote a schema last year that uses aliases to
>> > reconcile 9 different CSV data sets into a common model without writing
>> one
>> > line of code. This is all it takes:
>> >
>> > {
>> >   "name": "first_name",
>> >   "type": "string",
>> >   "aliases": [ "FirstName", "First Name", "FIRST_NAME", "fname",
>> "fName" ]
>> > }
>> >
>> > That one manual line just reconciled 5 fields into a common model.
>> >
>> > On Sun, Feb 2, 2020 at 9:32 PM Mike Thomsen <mi...@gmail.com>
>> > wrote:
>> >
>> > > Hi Emanuel,
>> > >
>> > > I think you raise some potentially valid issues that are worth
>> looking at
>> > > in more detail. I can say our experience with NiFi is exact opposite,
>> but
>> > > part of that is that we are a 100% "schema first" shop. Avro is
>> insanely
>> > > easy to learn, and we've gotten junior data engineers up to speed in a
>> > > matter of days producing beta quality data contracts that way.
>> > >
>> > > On Sat, Feb 1, 2020 at 12:33 PM Emanuel Oliveira <em...@gmail.com>
>> > > wrote:
>> > >
>> > >> Hi,
>> > >>
>> > >> Based on recent experience, I found very hard to implement logic
>> which i
>> > >> think should exists out of the box, and instead it was slow process
>> of
>> > >> keeping discovering a property on a processor only works for a type
>> of
>> > >> data
>> > >> when processor supports multiple types etc.
>> > >>
>> > >> I would like you all to keep it simple attitude and imagine hwo you
>> > would
>> > >> implement a basic scenario as:
>> > >>
>> > >> *basic scenario 1 - shall be easy to implement out of the box
>> following
>> > 3
>> > >> needs:*
>> > >> CSV (*get schema automatically via header line*) --> *validate
>> mandatory
>> > >> subset of fields (presence) and (data types)* --> *export subset of
>> > >> fields*
>> > >> or all (but want some of them obfuscated)
>> > >> problems/workarounds found 1.9 rc3
>> > >>
>> > >> *1. processor ValidateRecord*
>> > >> [1.1] *OK* - allows *getting schema automatically via header line*
>> and
>> > >> *mandatory
>> > >> subset of fields* (presence) via the 3 schema properties --> suggest
>> > >> rename
>> > >> properties to make clear those at processor level are "mandatory
>> check"
>> > vs
>> > >> the schema on reader which is the well the data read schema.
>> > >> [1.2] *NOK* - does not allow *types validation**.* *One could
>> thinking
>> > >> using InferSchema right ? problem is it only supports JSON.*
>> > >> [1.2] *NOK* - ignores writer schema where one could supply *subset of
>> > >> original fields* (always export all original fields) --> add
>> property to
>> > >> control export all fields (default) or use writer schema(with
>> subset).
>> > >>
>> > >> *2. processor ConvertRecord*
>> > >> [2.1] *OK* csvreader able to *get schema from header -*-> maybe
>> > >> improve/add
>> > >> property to cleanup fields (regex search/replace - so we can strip
>> > >> whitespaces and anything else that breaks nifi processors and/or that
>> > >> doesnt interest us)
>> > >> [2.2] *NOK* missing *mandatory subset of fields.*
>> > >> [2.3] *OK* but does good jobs converting between formats, and/or
>> *export
>> > >> all or subset of fields via writer schema*.
>> > >>
>> > >> *3. processor InferAvroSchema*
>> > >> [3.1] NOK - despite property "Input Content Type" lists CSV, JSON as
>> > >> inbound data, in reality the property "Number Of Records To Analyze"
>> > only
>> > >> supports JSON. Took us 2 days debugging to understand the problem.. 1
>> > CSV
>> > >> with 4k lines and mostly nulls, "1"s or "2"s but some few records
>> would
>> > be
>> > >> "true" or "false".. meaning avro data type should have been [null,
>> > string]
>> > >> but no.. as we found out, type kept being [null, long] with doc
>> always
>> > >> using 1st data line in CSV to determine field type. This was VERY
>> > scaring
>> > >> to find out.. how can it be this was fully working as expected ? We
>> > endup
>> > >> needing to add +1 processor to convert CSV into JSON so we could get
>> > >> proper
>> > >> schema.. and even now we still testing, as seems all fields got
>> [string]
>> > >> when some columns should be long.
>> > >>
>> > >> Im not sure the best way to expose this, but im working at enterprise
>> > >> level, and believe me, this small but critical nuances are starting
>> to
>> > >> push
>> > >> the mood on NiFi.
>> > >> But because I felt in love with NiFi and i like the idea of graphical
>> > >> design of flows etc, but we really must fix this critical little
>> > devils..
>> > >> they are being screamed as nifi problems at management level.
>> > >> I know nifi is open source, and its upon us developers to improve, i
>> > just
>> > >> would like to call attention that we must be sure on the middle of
>> PRs
>> > and
>> > >> JIRA enhancements we not forgetting the basic threshold.. doesn't
>> make
>> > >> sense to release a processor with only 50% of its main goal developed
>> > when
>> > >> the remaining work would be easy and fast to do (aka
>> InferAvroSchema).
>> > >>
>> > >> As i keep experimenting more and more with NiFi, i start detecting
>> the
>> > >> level of basic quality features is bellow from what i think it should
>> > be.
>> > >> Better not release incomplete processors at least regarding core
>> > function
>> > >> of the processor.
>> > >>
>> > >> I know developers can contributes with new code, fixes and
>> > enhancements..
>> > >> but is there any gatekeeper team double checking the deliverables ?
>> like
>> > >> at
>> > >> basic developer should provide enough unite tests.. again the
>> > >> InferAvroSchema being a processor to export avro schema based on
>> either
>> > a
>> > >> CSV or JSON, then obviously there should be couple unit testings CSVs
>> > and
>> > >> JSON with different data so we can be sure sure we have the proper
>> type
>> > on
>> > >> the avro schema exported right ?
>> > >>
>> > >> Above i share some ideas, and i got much more from my day by day
>> > >> experience
>> > >> that i been working with NiFi at entperise level for more than 1
>> year by
>> > >> now.
>> > >> Let me know what shall be the way to create JIRAs to fix several
>> > >> processors
>> > >> in order to allow aone unexperienced nifi client developer to
>> accomplish
>> > >> the basic flow of:
>> > >>
>> > >> CSV (*get schema automatically via header line*) --> *validate
>> mandatory
>> > >> subset of fields (presence) and (data types)* --> *export subset of
>> > >> fields*
>> > >> or all (but want some of them obfuscated)
>> > >>
>> > >> I challenge anyone to come out with flows to implement this basic
>> flow..
>> > >> and test and see what i mean,, you will see how incomplete and hard
>> are
>> > >> things.. which should not be the case at all. NiFi shall be true
>> Lego,
>> > add
>> > >> processors that says does XPTO and trust it will.. but we keep
>> finding a
>> > >> lot of nuances..
>> > >>
>> > >> I dont mind taking 1 day off my  and work have a meeting with some of
>> > you
>> > >> -
>> > >> dont know if theres such a thing as tech lead on nifi project? - and
>> i
>> > >> think would be urgent to fix the foundations of some processors. Let
>> me
>> > >> know..
>> > >>
>> > >>
>> > >>
>> > >> Best Regards,
>> > >> *Emanuel Oliveira*
>> > >>
>> > >
>> >
>> >
>>
>

Re: basic enhancements + incomplete processors + when nifi gets more complicated than it should..

Posted by Mike Thomsen <mi...@gmail.com>.

Hi Emanuel,

> This may look simple use case, but very hard to implement due.. but please
> do surprise me with sequence of processors needed to implement what i
think
> *its a great real world example of data quality*

I think this is the root of the problem. Personally, I wouldn't
characterize anything that relies on schema inference over a schema first
design as a "great real world example of data quality" because getting real
data quality takes a lot of hard data engineering work in an enterprise
environment. As the saying goes, there ain't no such thing as a free lunch.

Now, if you want to generate schemas in a robust way, here's one way I know
tends to yield good results:

https://github.com/FasterXML/jackson-dataformats-binary/tree/master/avro

It will take a POJO and generate an Avro schema from, and since Java is
fairly strongly typed language you just need to massage things a little
with some annotations to get certain nuances like I think
javax.validation.Nullable will automatically make a field nullable.

On Mon, Feb 3, 2020 at 1:59 PM Emanuel Oliveira <em...@gmail.com> wrote:

> Hi Mike,
>
> Let me summarize as i see my long post is not pads ng the clean easy
> message i intended:
> *processor I**nferAvroSchema*:
> - should retrieve types from analysing data from csv. property "Input
> Content Type" lists CSV, JSON but in reality the property "Number Of
> Records To Analyze" only works with Json. With CSV all types are strings..
> Not hard to detect if a field only contains digits or alphanumerics, only
> timestamps could need extra property to help with format (or out of the box
> just also detect timestamps as well.. not hard).
>
> *Mandatory subset of fields verification:*
> ValidateRecord allows optional 3 schema properties (outside reader and
> writer) to supply an avro schema to balidate mandatory subset of fields -
> but - ConvertRecord doesn't allow this.
>
>
> Finally i would like to request your suggestion for following use case(same
> we struggled):
> - given 1 csv with header line listing 100 fields we want:
> --- validate mandatory fields (just 1 or 2 fields).
> --- automatic create avroschema based on data lines.
> ---export avro like this:
> ------ some fields obfuscated + remaining fields not obfuscated (or the
> other way around: some fields not obfuscated + remaining fields
> obfuscated). And of course header line stay in line with final fields
> order.
>
> This may look simple use case, but very hard to implement due.. but please
> do surprise me with sequence of processors needed to implement what i think
> its a great real world example of data quality (mandatory fields + parcial
> obfuscation + export as different format and just subset of the fields and
> where some obfuscated and others not).
>
> Thanks and hope mote clear, im sure this will help more dev teams.
>
> Cheers,
> Emanuel
>
>
>
>
> On Mon 3 Feb 2020, 13:50 Mike Thomsen, <mi...@gmail.com> wrote:
>
> > One thing I should mention is that schema inference is simply not capable
> > of exploiting Avro's field aliasing. That's an incredibly powerful
> feature
> > that allows you to reconcile data sets without writing a single line of
> > code. For example, I wrote a schema last year that uses aliases to
> > reconcile 9 different CSV data sets into a common model without writing
> one
> > line of code. This is all it takes:
> >
> > {
> >   "name": "first_name",
> >   "type": "string",
> >   "aliases": [ "FirstName", "First Name", "FIRST_NAME", "fname", "fName"
> ]
> > }
> >
> > That one manual line just reconciled 5 fields into a common model.
> >
> > On Sun, Feb 2, 2020 at 9:32 PM Mike Thomsen <mi...@gmail.com>
> > wrote:
> >
> > > Hi Emanuel,
> > >
> > > I think you raise some potentially valid issues that are worth looking
> at
> > > in more detail. I can say our experience with NiFi is exact opposite,
> but
> > > part of that is that we are a 100% "schema first" shop. Avro is
> insanely
> > > easy to learn, and we've gotten junior data engineers up to speed in a
> > > matter of days producing beta quality data contracts that way.
> > >
> > > On Sat, Feb 1, 2020 at 12:33 PM Emanuel Oliveira <em...@gmail.com>
> > > wrote:
> > >
> > >> Hi,
> > >>
> > >> Based on recent experience, I found very hard to implement logic
> which i
> > >> think should exists out of the box, and instead it was slow process of
> > >> keeping discovering a property on a processor only works for a type of
> > >> data
> > >> when processor supports multiple types etc.
> > >>
> > >> I would like you all to keep it simple attitude and imagine hwo you
> > would
> > >> implement a basic scenario as:
> > >>
> > >> *basic scenario 1 - shall be easy to implement out of the box
> following
> > 3
> > >> needs:*
> > >> CSV (*get schema automatically via header line*) --> *validate
> mandatory
> > >> subset of fields (presence) and (data types)* --> *export subset of
> > >> fields*
> > >> or all (but want some of them obfuscated)
> > >> problems/workarounds found 1.9 rc3
> > >>
> > >> *1. processor ValidateRecord*
> > >> [1.1] *OK* - allows *getting schema automatically via header line* and
> > >> *mandatory
> > >> subset of fields* (presence) via the 3 schema properties --> suggest
> > >> rename
> > >> properties to make clear those at processor level are "mandatory
> check"
> > vs
> > >> the schema on reader which is the well the data read schema.
> > >> [1.2] *NOK* - does not allow *types validation**.* *One could thinking
> > >> using InferSchema right ? problem is it only supports JSON.*
> > >> [1.2] *NOK* - ignores writer schema where one could supply *subset of
> > >> original fields* (always export all original fields) --> add property
> to
> > >> control export all fields (default) or use writer schema(with subset).
> > >>
> > >> *2. processor ConvertRecord*
> > >> [2.1] *OK* csvreader able to *get schema from header -*-> maybe
> > >> improve/add
> > >> property to cleanup fields (regex search/replace - so we can strip
> > >> whitespaces and anything else that breaks nifi processors and/or that
> > >> doesnt interest us)
> > >> [2.2] *NOK* missing *mandatory subset of fields.*
> > >> [2.3] *OK* but does good jobs converting between formats, and/or
> *export
> > >> all or subset of fields via writer schema*.
> > >>
> > >> *3. processor InferAvroSchema*
> > >> [3.1] NOK - despite property "Input Content Type" lists CSV, JSON as
> > >> inbound data, in reality the property "Number Of Records To Analyze"
> > only
> > >> supports JSON. Took us 2 days debugging to understand the problem.. 1
> > CSV
> > >> with 4k lines and mostly nulls, "1"s or "2"s but some few records
> would
> > be
> > >> "true" or "false".. meaning avro data type should have been [null,
> > string]
> > >> but no.. as we found out, type kept being [null, long] with doc always
> > >> using 1st data line in CSV to determine field type. This was VERY
> > scaring
> > >> to find out.. how can it be this was fully working as expected ? We
> > endup
> > >> needing to add +1 processor to convert CSV into JSON so we could get
> > >> proper
> > >> schema.. and even now we still testing, as seems all fields got
> [string]
> > >> when some columns should be long.
> > >>
> > >> Im not sure the best way to expose this, but im working at enterprise
> > >> level, and believe me, this small but critical nuances are starting to
> > >> push
> > >> the mood on NiFi.
> > >> But because I felt in love with NiFi and i like the idea of graphical
> > >> design of flows etc, but we really must fix this critical little
> > devils..
> > >> they are being screamed as nifi problems at management level.
> > >> I know nifi is open source, and its upon us developers to improve, i
> > just
> > >> would like to call attention that we must be sure on the middle of PRs
> > and
> > >> JIRA enhancements we not forgetting the basic threshold.. doesn't make
> > >> sense to release a processor with only 50% of its main goal developed
> > when
> > >> the remaining work would be easy and fast to do (aka InferAvroSchema).
> > >>
> > >> As i keep experimenting more and more with NiFi, i start detecting the
> > >> level of basic quality features is bellow from what i think it should
> > be.
> > >> Better not release incomplete processors at least regarding core
> > function
> > >> of the processor.
> > >>
> > >> I know developers can contributes with new code, fixes and
> > enhancements..
> > >> but is there any gatekeeper team double checking the deliverables ?
> like
> > >> at
> > >> basic developer should provide enough unite tests.. again the
> > >> InferAvroSchema being a processor to export avro schema based on
> either
> > a
> > >> CSV or JSON, then obviously there should be couple unit testings CSVs
> > and
> > >> JSON with different data so we can be sure sure we have the proper
> type
> > on
> > >> the avro schema exported right ?
> > >>
> > >> Above i share some ideas, and i got much more from my day by day
> > >> experience
> > >> that i been working with NiFi at entperise level for more than 1 year
> by
> > >> now.
> > >> Let me know what shall be the way to create JIRAs to fix several
> > >> processors
> > >> in order to allow aone unexperienced nifi client developer to
> accomplish
> > >> the basic flow of:
> > >>
> > >> CSV (*get schema automatically via header line*) --> *validate
> mandatory
> > >> subset of fields (presence) and (data types)* --> *export subset of
> > >> fields*
> > >> or all (but want some of them obfuscated)
> > >>
> > >> I challenge anyone to come out with flows to implement this basic
> flow..
> > >> and test and see what i mean,, you will see how incomplete and hard
> are
> > >> things.. which should not be the case at all. NiFi shall be true Lego,
> > add
> > >> processors that says does XPTO and trust it will.. but we keep
> finding a
> > >> lot of nuances..
> > >>
> > >> I dont mind taking 1 day off my  and work have a meeting with some of
> > you
> > >> -
> > >> dont know if theres such a thing as tech lead on nifi project? - and i
> > >> think would be urgent to fix the foundations of some processors. Let
> me
> > >> know..
> > >>
> > >>
> > >>
> > >> Best Regards,
> > >> *Emanuel Oliveira*
> > >>
> > >
> >
> >
>

Re: basic enhancements + incomplete processors + when nifi gets more complicated than it should..

Posted by Emanuel Oliveira <em...@gmail.com>.

Thanks Pierre, seems I already had an account:
username: emanueol



Best Regards,
*Emanuel Oliveira*



On Mon, Feb 3, 2020 at 7:34 PM Pierre Villard <pi...@gmail.com>
wrote:

> Hi Emanuel,
>
> Just wanted to answer to your questions regarding JIRA. You can create an
> account on the Apache JIRA [1] and open JIRAs on the NiFi project [2]. Once
> you created an account and logged in the JIRA once, you can share your
> login with us, and we can grant you the "contributor" role which gives you
> the right to assign yourself a JIRA if you want. But there is no specific
> requirements to create JIRAs and/or comment existing JIRAs.
>
> [1] https://issues.apache.org/jira/secure/Signup!default.jspa
> [2] https://issues.apache.org/jira/projects/NIFI
>
>
> Le lun. 3 févr. 2020 à 13:59, Emanuel Oliveira <em...@gmail.com> a
> écrit :
>
> > Hi Mike,
> >
> > Let me summarize as i see my long post is not pads ng the clean easy
> > message i intended:
> > *processor I**nferAvroSchema*:
> > - should retrieve types from analysing data from csv. property "Input
> > Content Type" lists CSV, JSON but in reality the property "Number Of
> > Records To Analyze" only works with Json. With CSV all types are
> strings..
> > Not hard to detect if a field only contains digits or alphanumerics, only
> > timestamps could need extra property to help with format (or out of the
> box
> > just also detect timestamps as well.. not hard).
> >
> > *Mandatory subset of fields verification:*
> > ValidateRecord allows optional 3 schema properties (outside reader and
> > writer) to supply an avro schema to balidate mandatory subset of fields -
> > but - ConvertRecord doesn't allow this.
> >
> >
> > Finally i would like to request your suggestion for following use
> case(same
> > we struggled):
> > - given 1 csv with header line listing 100 fields we want:
> > --- validate mandatory fields (just 1 or 2 fields).
> > --- automatic create avroschema based on data lines.
> > ---export avro like this:
> > ------ some fields obfuscated + remaining fields not obfuscated (or the
> > other way around: some fields not obfuscated + remaining fields
> > obfuscated). And of course header line stay in line with final fields
> > order.
> >
> > This may look simple use case, but very hard to implement due.. but
> please
> > do surprise me with sequence of processors needed to implement what i
> think
> > its a great real world example of data quality (mandatory fields +
> parcial
> > obfuscation + export as different format and just subset of the fields
> and
> > where some obfuscated and others not).
> >
> > Thanks and hope mote clear, im sure this will help more dev teams.
> >
> > Cheers,
> > Emanuel
> >
> >
> >
> >
> > On Mon 3 Feb 2020, 13:50 Mike Thomsen, <mi...@gmail.com> wrote:
> >
> > > One thing I should mention is that schema inference is simply not
> capable
> > > of exploiting Avro's field aliasing. That's an incredibly powerful
> > feature
> > > that allows you to reconcile data sets without writing a single line of
> > > code. For example, I wrote a schema last year that uses aliases to
> > > reconcile 9 different CSV data sets into a common model without writing
> > one
> > > line of code. This is all it takes:
> > >
> > > {
> > >   "name": "first_name",
> > >   "type": "string",
> > >   "aliases": [ "FirstName", "First Name", "FIRST_NAME", "fname",
> "fName"
> > ]
> > > }
> > >
> > > That one manual line just reconciled 5 fields into a common model.
> > >
> > > On Sun, Feb 2, 2020 at 9:32 PM Mike Thomsen <mi...@gmail.com>
> > > wrote:
> > >
> > > > Hi Emanuel,
> > > >
> > > > I think you raise some potentially valid issues that are worth
> looking
> > at
> > > > in more detail. I can say our experience with NiFi is exact opposite,
> > but
> > > > part of that is that we are a 100% "schema first" shop. Avro is
> > insanely
> > > > easy to learn, and we've gotten junior data engineers up to speed in
> a
> > > > matter of days producing beta quality data contracts that way.
> > > >
> > > > On Sat, Feb 1, 2020 at 12:33 PM Emanuel Oliveira <emanueol@gmail.com
> >
> > > > wrote:
> > > >
> > > >> Hi,
> > > >>
> > > >> Based on recent experience, I found very hard to implement logic
> > which i
> > > >> think should exists out of the box, and instead it was slow process
> of
> > > >> keeping discovering a property on a processor only works for a type
> of
> > > >> data
> > > >> when processor supports multiple types etc.
> > > >>
> > > >> I would like you all to keep it simple attitude and imagine hwo you
> > > would
> > > >> implement a basic scenario as:
> > > >>
> > > >> *basic scenario 1 - shall be easy to implement out of the box
> > following
> > > 3
> > > >> needs:*
> > > >> CSV (*get schema automatically via header line*) --> *validate
> > mandatory
> > > >> subset of fields (presence) and (data types)* --> *export subset of
> > > >> fields*
> > > >> or all (but want some of them obfuscated)
> > > >> problems/workarounds found 1.9 rc3
> > > >>
> > > >> *1. processor ValidateRecord*
> > > >> [1.1] *OK* - allows *getting schema automatically via header line*
> and
> > > >> *mandatory
> > > >> subset of fields* (presence) via the 3 schema properties --> suggest
> > > >> rename
> > > >> properties to make clear those at processor level are "mandatory
> > check"
> > > vs
> > > >> the schema on reader which is the well the data read schema.
> > > >> [1.2] *NOK* - does not allow *types validation**.* *One could
> thinking
> > > >> using InferSchema right ? problem is it only supports JSON.*
> > > >> [1.2] *NOK* - ignores writer schema where one could supply *subset
> of
> > > >> original fields* (always export all original fields) --> add
> property
> > to
> > > >> control export all fields (default) or use writer schema(with
> subset).
> > > >>
> > > >> *2. processor ConvertRecord*
> > > >> [2.1] *OK* csvreader able to *get schema from header -*-> maybe
> > > >> improve/add
> > > >> property to cleanup fields (regex search/replace - so we can strip
> > > >> whitespaces and anything else that breaks nifi processors and/or
> that
> > > >> doesnt interest us)
> > > >> [2.2] *NOK* missing *mandatory subset of fields.*
> > > >> [2.3] *OK* but does good jobs converting between formats, and/or
> > *export
> > > >> all or subset of fields via writer schema*.
> > > >>
> > > >> *3. processor InferAvroSchema*
> > > >> [3.1] NOK - despite property "Input Content Type" lists CSV, JSON as
> > > >> inbound data, in reality the property "Number Of Records To Analyze"
> > > only
> > > >> supports JSON. Took us 2 days debugging to understand the problem..
> 1
> > > CSV
> > > >> with 4k lines and mostly nulls, "1"s or "2"s but some few records
> > would
> > > be
> > > >> "true" or "false".. meaning avro data type should have been [null,
> > > string]
> > > >> but no.. as we found out, type kept being [null, long] with doc
> always
> > > >> using 1st data line in CSV to determine field type. This was VERY
> > > scaring
> > > >> to find out.. how can it be this was fully working as expected ? We
> > > endup
> > > >> needing to add +1 processor to convert CSV into JSON so we could get
> > > >> proper
> > > >> schema.. and even now we still testing, as seems all fields got
> > [string]
> > > >> when some columns should be long.
> > > >>
> > > >> Im not sure the best way to expose this, but im working at
> enterprise
> > > >> level, and believe me, this small but critical nuances are starting
> to
> > > >> push
> > > >> the mood on NiFi.
> > > >> But because I felt in love with NiFi and i like the idea of
> graphical
> > > >> design of flows etc, but we really must fix this critical little
> > > devils..
> > > >> they are being screamed as nifi problems at management level.
> > > >> I know nifi is open source, and its upon us developers to improve, i
> > > just
> > > >> would like to call attention that we must be sure on the middle of
> PRs
> > > and
> > > >> JIRA enhancements we not forgetting the basic threshold.. doesn't
> make
> > > >> sense to release a processor with only 50% of its main goal
> developed
> > > when
> > > >> the remaining work would be easy and fast to do (aka
> InferAvroSchema).
> > > >>
> > > >> As i keep experimenting more and more with NiFi, i start detecting
> the
> > > >> level of basic quality features is bellow from what i think it
> should
> > > be.
> > > >> Better not release incomplete processors at least regarding core
> > > function
> > > >> of the processor.
> > > >>
> > > >> I know developers can contributes with new code, fixes and
> > > enhancements..
> > > >> but is there any gatekeeper team double checking the deliverables ?
> > like
> > > >> at
> > > >> basic developer should provide enough unite tests.. again the
> > > >> InferAvroSchema being a processor to export avro schema based on
> > either
> > > a
> > > >> CSV or JSON, then obviously there should be couple unit testings
> CSVs
> > > and
> > > >> JSON with different data so we can be sure sure we have the proper
> > type
> > > on
> > > >> the avro schema exported right ?
> > > >>
> > > >> Above i share some ideas, and i got much more from my day by day
> > > >> experience
> > > >> that i been working with NiFi at entperise level for more than 1
> year
> > by
> > > >> now.
> > > >> Let me know what shall be the way to create JIRAs to fix several
> > > >> processors
> > > >> in order to allow aone unexperienced nifi client developer to
> > accomplish
> > > >> the basic flow of:
> > > >>
> > > >> CSV (*get schema automatically via header line*) --> *validate
> > mandatory
> > > >> subset of fields (presence) and (data types)* --> *export subset of
> > > >> fields*
> > > >> or all (but want some of them obfuscated)
> > > >>
> > > >> I challenge anyone to come out with flows to implement this basic
> > flow..
> > > >> and test and see what i mean,, you will see how incomplete and hard
> > are
> > > >> things.. which should not be the case at all. NiFi shall be true
> Lego,
> > > add
> > > >> processors that says does XPTO and trust it will.. but we keep
> > finding a
> > > >> lot of nuances..
> > > >>
> > > >> I dont mind taking 1 day off my  and work have a meeting with some
> of
> > > you
> > > >> -
> > > >> dont know if theres such a thing as tech lead on nifi project? -
> and i
> > > >> think would be urgent to fix the foundations of some processors. Let
> > me
> > > >> know..
> > > >>
> > > >>
> > > >>
> > > >> Best Regards,
> > > >> *Emanuel Oliveira*
> > > >>
> > > >
> > >
> > >
> >
>

Re: basic enhancements + incomplete processors + when nifi gets more complicated than it should..

Posted by Pierre Villard <pi...@gmail.com>.

Hi Emanuel,

Just wanted to answer to your questions regarding JIRA. You can create an
account on the Apache JIRA [1] and open JIRAs on the NiFi project [2]. Once
you created an account and logged in the JIRA once, you can share your
login with us, and we can grant you the "contributor" role which gives you
the right to assign yourself a JIRA if you want. But there is no specific
requirements to create JIRAs and/or comment existing JIRAs.

[1] https://issues.apache.org/jira/secure/Signup!default.jspa
[2] https://issues.apache.org/jira/projects/NIFI


Le lun. 3 févr. 2020 à 13:59, Emanuel Oliveira <em...@gmail.com> a
écrit :

> Hi Mike,
>
> Let me summarize as i see my long post is not pads ng the clean easy
> message i intended:
> *processor I**nferAvroSchema*:
> - should retrieve types from analysing data from csv. property "Input
> Content Type" lists CSV, JSON but in reality the property "Number Of
> Records To Analyze" only works with Json. With CSV all types are strings..
> Not hard to detect if a field only contains digits or alphanumerics, only
> timestamps could need extra property to help with format (or out of the box
> just also detect timestamps as well.. not hard).
>
> *Mandatory subset of fields verification:*
> ValidateRecord allows optional 3 schema properties (outside reader and
> writer) to supply an avro schema to balidate mandatory subset of fields -
> but - ConvertRecord doesn't allow this.
>
>
> Finally i would like to request your suggestion for following use case(same
> we struggled):
> - given 1 csv with header line listing 100 fields we want:
> --- validate mandatory fields (just 1 or 2 fields).
> --- automatic create avroschema based on data lines.
> ---export avro like this:
> ------ some fields obfuscated + remaining fields not obfuscated (or the
> other way around: some fields not obfuscated + remaining fields
> obfuscated). And of course header line stay in line with final fields
> order.
>
> This may look simple use case, but very hard to implement due.. but please
> do surprise me with sequence of processors needed to implement what i think
> its a great real world example of data quality (mandatory fields + parcial
> obfuscation + export as different format and just subset of the fields and
> where some obfuscated and others not).
>
> Thanks and hope mote clear, im sure this will help more dev teams.
>
> Cheers,
> Emanuel
>
>
>
>
> On Mon 3 Feb 2020, 13:50 Mike Thomsen, <mi...@gmail.com> wrote:
>
> > One thing I should mention is that schema inference is simply not capable
> > of exploiting Avro's field aliasing. That's an incredibly powerful
> feature
> > that allows you to reconcile data sets without writing a single line of
> > code. For example, I wrote a schema last year that uses aliases to
> > reconcile 9 different CSV data sets into a common model without writing
> one
> > line of code. This is all it takes:
> >
> > {
> >   "name": "first_name",
> >   "type": "string",
> >   "aliases": [ "FirstName", "First Name", "FIRST_NAME", "fname", "fName"
> ]
> > }
> >
> > That one manual line just reconciled 5 fields into a common model.
> >
> > On Sun, Feb 2, 2020 at 9:32 PM Mike Thomsen <mi...@gmail.com>
> > wrote:
> >
> > > Hi Emanuel,
> > >
> > > I think you raise some potentially valid issues that are worth looking
> at
> > > in more detail. I can say our experience with NiFi is exact opposite,
> but
> > > part of that is that we are a 100% "schema first" shop. Avro is
> insanely
> > > easy to learn, and we've gotten junior data engineers up to speed in a
> > > matter of days producing beta quality data contracts that way.
> > >
> > > On Sat, Feb 1, 2020 at 12:33 PM Emanuel Oliveira <em...@gmail.com>
> > > wrote:
> > >
> > >> Hi,
> > >>
> > >> Based on recent experience, I found very hard to implement logic
> which i
> > >> think should exists out of the box, and instead it was slow process of
> > >> keeping discovering a property on a processor only works for a type of
> > >> data
> > >> when processor supports multiple types etc.
> > >>
> > >> I would like you all to keep it simple attitude and imagine hwo you
> > would
> > >> implement a basic scenario as:
> > >>
> > >> *basic scenario 1 - shall be easy to implement out of the box
> following
> > 3
> > >> needs:*
> > >> CSV (*get schema automatically via header line*) --> *validate
> mandatory
> > >> subset of fields (presence) and (data types)* --> *export subset of
> > >> fields*
> > >> or all (but want some of them obfuscated)
> > >> problems/workarounds found 1.9 rc3
> > >>
> > >> *1. processor ValidateRecord*
> > >> [1.1] *OK* - allows *getting schema automatically via header line* and
> > >> *mandatory
> > >> subset of fields* (presence) via the 3 schema properties --> suggest
> > >> rename
> > >> properties to make clear those at processor level are "mandatory
> check"
> > vs
> > >> the schema on reader which is the well the data read schema.
> > >> [1.2] *NOK* - does not allow *types validation**.* *One could thinking
> > >> using InferSchema right ? problem is it only supports JSON.*
> > >> [1.2] *NOK* - ignores writer schema where one could supply *subset of
> > >> original fields* (always export all original fields) --> add property
> to
> > >> control export all fields (default) or use writer schema(with subset).
> > >>
> > >> *2. processor ConvertRecord*
> > >> [2.1] *OK* csvreader able to *get schema from header -*-> maybe
> > >> improve/add
> > >> property to cleanup fields (regex search/replace - so we can strip
> > >> whitespaces and anything else that breaks nifi processors and/or that
> > >> doesnt interest us)
> > >> [2.2] *NOK* missing *mandatory subset of fields.*
> > >> [2.3] *OK* but does good jobs converting between formats, and/or
> *export
> > >> all or subset of fields via writer schema*.
> > >>
> > >> *3. processor InferAvroSchema*
> > >> [3.1] NOK - despite property "Input Content Type" lists CSV, JSON as
> > >> inbound data, in reality the property "Number Of Records To Analyze"
> > only
> > >> supports JSON. Took us 2 days debugging to understand the problem.. 1
> > CSV
> > >> with 4k lines and mostly nulls, "1"s or "2"s but some few records
> would
> > be
> > >> "true" or "false".. meaning avro data type should have been [null,
> > string]
> > >> but no.. as we found out, type kept being [null, long] with doc always
> > >> using 1st data line in CSV to determine field type. This was VERY
> > scaring
> > >> to find out.. how can it be this was fully working as expected ? We
> > endup
> > >> needing to add +1 processor to convert CSV into JSON so we could get
> > >> proper
> > >> schema.. and even now we still testing, as seems all fields got
> [string]
> > >> when some columns should be long.
> > >>
> > >> Im not sure the best way to expose this, but im working at enterprise
> > >> level, and believe me, this small but critical nuances are starting to
> > >> push
> > >> the mood on NiFi.
> > >> But because I felt in love with NiFi and i like the idea of graphical
> > >> design of flows etc, but we really must fix this critical little
> > devils..
> > >> they are being screamed as nifi problems at management level.
> > >> I know nifi is open source, and its upon us developers to improve, i
> > just
> > >> would like to call attention that we must be sure on the middle of PRs
> > and
> > >> JIRA enhancements we not forgetting the basic threshold.. doesn't make
> > >> sense to release a processor with only 50% of its main goal developed
> > when
> > >> the remaining work would be easy and fast to do (aka InferAvroSchema).
> > >>
> > >> As i keep experimenting more and more with NiFi, i start detecting the
> > >> level of basic quality features is bellow from what i think it should
> > be.
> > >> Better not release incomplete processors at least regarding core
> > function
> > >> of the processor.
> > >>
> > >> I know developers can contributes with new code, fixes and
> > enhancements..
> > >> but is there any gatekeeper team double checking the deliverables ?
> like
> > >> at
> > >> basic developer should provide enough unite tests.. again the
> > >> InferAvroSchema being a processor to export avro schema based on
> either
> > a
> > >> CSV or JSON, then obviously there should be couple unit testings CSVs
> > and
> > >> JSON with different data so we can be sure sure we have the proper
> type
> > on
> > >> the avro schema exported right ?
> > >>
> > >> Above i share some ideas, and i got much more from my day by day
> > >> experience
> > >> that i been working with NiFi at entperise level for more than 1 year
> by
> > >> now.
> > >> Let me know what shall be the way to create JIRAs to fix several
> > >> processors
> > >> in order to allow aone unexperienced nifi client developer to
> accomplish
> > >> the basic flow of:
> > >>
> > >> CSV (*get schema automatically via header line*) --> *validate
> mandatory
> > >> subset of fields (presence) and (data types)* --> *export subset of
> > >> fields*
> > >> or all (but want some of them obfuscated)
> > >>
> > >> I challenge anyone to come out with flows to implement this basic
> flow..
> > >> and test and see what i mean,, you will see how incomplete and hard
> are
> > >> things.. which should not be the case at all. NiFi shall be true Lego,
> > add
> > >> processors that says does XPTO and trust it will.. but we keep
> finding a
> > >> lot of nuances..
> > >>
> > >> I dont mind taking 1 day off my  and work have a meeting with some of
> > you
> > >> -
> > >> dont know if theres such a thing as tech lead on nifi project? - and i
> > >> think would be urgent to fix the foundations of some processors. Let
> me
> > >> know..
> > >>
> > >>
> > >>
> > >> Best Regards,
> > >> *Emanuel Oliveira*
> > >>
> > >
> >
> >
>

Re: basic enhancements + incomplete processors + when nifi gets more complicated than it should..

Posted by Emanuel Oliveira <em...@gmail.com>.

Hi Mike,

Let me summarize as i see my long post is not pads ng the clean easy
message i intended:
*processor I**nferAvroSchema*:
- should retrieve types from analysing data from csv. property "Input
Content Type" lists CSV, JSON but in reality the property "Number Of
Records To Analyze" only works with Json. With CSV all types are strings..
Not hard to detect if a field only contains digits or alphanumerics, only
timestamps could need extra property to help with format (or out of the box
just also detect timestamps as well.. not hard).

*Mandatory subset of fields verification:*
ValidateRecord allows optional 3 schema properties (outside reader and
writer) to supply an avro schema to balidate mandatory subset of fields -
but - ConvertRecord doesn't allow this.


Finally i would like to request your suggestion for following use case(same
we struggled):
- given 1 csv with header line listing 100 fields we want:
--- validate mandatory fields (just 1 or 2 fields).
--- automatic create avroschema based on data lines.
---export avro like this:
------ some fields obfuscated + remaining fields not obfuscated (or the
other way around: some fields not obfuscated + remaining fields
obfuscated). And of course header line stay in line with final fields order.

This may look simple use case, but very hard to implement due.. but please
do surprise me with sequence of processors needed to implement what i think
its a great real world example of data quality (mandatory fields + parcial
obfuscation + export as different format and just subset of the fields and
where some obfuscated and others not).

Thanks and hope mote clear, im sure this will help more dev teams.

Cheers,
Emanuel




On Mon 3 Feb 2020, 13:50 Mike Thomsen, <mi...@gmail.com> wrote:

> One thing I should mention is that schema inference is simply not capable
> of exploiting Avro's field aliasing. That's an incredibly powerful feature
> that allows you to reconcile data sets without writing a single line of
> code. For example, I wrote a schema last year that uses aliases to
> reconcile 9 different CSV data sets into a common model without writing one
> line of code. This is all it takes:
>
> {
>   "name": "first_name",
>   "type": "string",
>   "aliases": [ "FirstName", "First Name", "FIRST_NAME", "fname", "fName" ]
> }
>
> That one manual line just reconciled 5 fields into a common model.
>
> On Sun, Feb 2, 2020 at 9:32 PM Mike Thomsen <mi...@gmail.com>
> wrote:
>
> > Hi Emanuel,
> >
> > I think you raise some potentially valid issues that are worth looking at
> > in more detail. I can say our experience with NiFi is exact opposite, but
> > part of that is that we are a 100% "schema first" shop. Avro is insanely
> > easy to learn, and we've gotten junior data engineers up to speed in a
> > matter of days producing beta quality data contracts that way.
> >
> > On Sat, Feb 1, 2020 at 12:33 PM Emanuel Oliveira <em...@gmail.com>
> > wrote:
> >
> >> Hi,
> >>
> >> Based on recent experience, I found very hard to implement logic which i
> >> think should exists out of the box, and instead it was slow process of
> >> keeping discovering a property on a processor only works for a type of
> >> data
> >> when processor supports multiple types etc.
> >>
> >> I would like you all to keep it simple attitude and imagine hwo you
> would
> >> implement a basic scenario as:
> >>
> >> *basic scenario 1 - shall be easy to implement out of the box following
> 3
> >> needs:*
> >> CSV (*get schema automatically via header line*) --> *validate mandatory
> >> subset of fields (presence) and (data types)* --> *export subset of
> >> fields*
> >> or all (but want some of them obfuscated)
> >> problems/workarounds found 1.9 rc3
> >>
> >> *1. processor ValidateRecord*
> >> [1.1] *OK* - allows *getting schema automatically via header line* and
> >> *mandatory
> >> subset of fields* (presence) via the 3 schema properties --> suggest
> >> rename
> >> properties to make clear those at processor level are "mandatory check"
> vs
> >> the schema on reader which is the well the data read schema.
> >> [1.2] *NOK* - does not allow *types validation**.* *One could thinking
> >> using InferSchema right ? problem is it only supports JSON.*
> >> [1.2] *NOK* - ignores writer schema where one could supply *subset of
> >> original fields* (always export all original fields) --> add property to
> >> control export all fields (default) or use writer schema(with subset).
> >>
> >> *2. processor ConvertRecord*
> >> [2.1] *OK* csvreader able to *get schema from header -*-> maybe
> >> improve/add
> >> property to cleanup fields (regex search/replace - so we can strip
> >> whitespaces and anything else that breaks nifi processors and/or that
> >> doesnt interest us)
> >> [2.2] *NOK* missing *mandatory subset of fields.*
> >> [2.3] *OK* but does good jobs converting between formats, and/or *export
> >> all or subset of fields via writer schema*.
> >>
> >> *3. processor InferAvroSchema*
> >> [3.1] NOK - despite property "Input Content Type" lists CSV, JSON as
> >> inbound data, in reality the property "Number Of Records To Analyze"
> only
> >> supports JSON. Took us 2 days debugging to understand the problem.. 1
> CSV
> >> with 4k lines and mostly nulls, "1"s or "2"s but some few records would
> be
> >> "true" or "false".. meaning avro data type should have been [null,
> string]
> >> but no.. as we found out, type kept being [null, long] with doc always
> >> using 1st data line in CSV to determine field type. This was VERY
> scaring
> >> to find out.. how can it be this was fully working as expected ? We
> endup
> >> needing to add +1 processor to convert CSV into JSON so we could get
> >> proper
> >> schema.. and even now we still testing, as seems all fields got [string]
> >> when some columns should be long.
> >>
> >> Im not sure the best way to expose this, but im working at enterprise
> >> level, and believe me, this small but critical nuances are starting to
> >> push
> >> the mood on NiFi.
> >> But because I felt in love with NiFi and i like the idea of graphical
> >> design of flows etc, but we really must fix this critical little
> devils..
> >> they are being screamed as nifi problems at management level.
> >> I know nifi is open source, and its upon us developers to improve, i
> just
> >> would like to call attention that we must be sure on the middle of PRs
> and
> >> JIRA enhancements we not forgetting the basic threshold.. doesn't make
> >> sense to release a processor with only 50% of its main goal developed
> when
> >> the remaining work would be easy and fast to do (aka InferAvroSchema).
> >>
> >> As i keep experimenting more and more with NiFi, i start detecting the
> >> level of basic quality features is bellow from what i think it should
> be.
> >> Better not release incomplete processors at least regarding core
> function
> >> of the processor.
> >>
> >> I know developers can contributes with new code, fixes and
> enhancements..
> >> but is there any gatekeeper team double checking the deliverables ? like
> >> at
> >> basic developer should provide enough unite tests.. again the
> >> InferAvroSchema being a processor to export avro schema based on either
> a
> >> CSV or JSON, then obviously there should be couple unit testings CSVs
> and
> >> JSON with different data so we can be sure sure we have the proper type
> on
> >> the avro schema exported right ?
> >>
> >> Above i share some ideas, and i got much more from my day by day
> >> experience
> >> that i been working with NiFi at entperise level for more than 1 year by
> >> now.
> >> Let me know what shall be the way to create JIRAs to fix several
> >> processors
> >> in order to allow aone unexperienced nifi client developer to accomplish
> >> the basic flow of:
> >>
> >> CSV (*get schema automatically via header line*) --> *validate mandatory
> >> subset of fields (presence) and (data types)* --> *export subset of
> >> fields*
> >> or all (but want some of them obfuscated)
> >>
> >> I challenge anyone to come out with flows to implement this basic flow..
> >> and test and see what i mean,, you will see how incomplete and hard are
> >> things.. which should not be the case at all. NiFi shall be true Lego,
> add
> >> processors that says does XPTO and trust it will.. but we keep finding a
> >> lot of nuances..
> >>
> >> I dont mind taking 1 day off my  and work have a meeting with some of
> you
> >> -
> >> dont know if theres such a thing as tech lead on nifi project? - and i
> >> think would be urgent to fix the foundations of some processors. Let me
> >> know..
> >>
> >>
> >>
> >> Best Regards,
> >> *Emanuel Oliveira*
> >>
> >
>
>

Re: basic enhancements + incomplete processors + when nifi gets more complicated than it should..

Posted by Mike Thomsen <mi...@gmail.com>.

One thing I should mention is that schema inference is simply not capable
of exploiting Avro's field aliasing. That's an incredibly powerful feature
that allows you to reconcile data sets without writing a single line of
code. For example, I wrote a schema last year that uses aliases to
reconcile 9 different CSV data sets into a common model without writing one
line of code. This is all it takes:

{
  "name": "first_name",
  "type": "string",
  "aliases": [ "FirstName", "First Name", "FIRST_NAME", "fname", "fName" ]
}

That one manual line just reconciled 5 fields into a common model.

On Sun, Feb 2, 2020 at 9:32 PM Mike Thomsen <mi...@gmail.com> wrote:

> Hi Emanuel,
>
> I think you raise some potentially valid issues that are worth looking at
> in more detail. I can say our experience with NiFi is exact opposite, but
> part of that is that we are a 100% "schema first" shop. Avro is insanely
> easy to learn, and we've gotten junior data engineers up to speed in a
> matter of days producing beta quality data contracts that way.
>
> On Sat, Feb 1, 2020 at 12:33 PM Emanuel Oliveira <em...@gmail.com>
> wrote:
>
>> Hi,
>>
>> Based on recent experience, I found very hard to implement logic which i
>> think should exists out of the box, and instead it was slow process of
>> keeping discovering a property on a processor only works for a type of
>> data
>> when processor supports multiple types etc.
>>
>> I would like you all to keep it simple attitude and imagine hwo you would
>> implement a basic scenario as:
>>
>> *basic scenario 1 - shall be easy to implement out of the box following 3
>> needs:*
>> CSV (*get schema automatically via header line*) --> *validate mandatory
>> subset of fields (presence) and (data types)* --> *export subset of
>> fields*
>> or all (but want some of them obfuscated)
>> problems/workarounds found 1.9 rc3
>>
>> *1. processor ValidateRecord*
>> [1.1] *OK* - allows *getting schema automatically via header line* and
>> *mandatory
>> subset of fields* (presence) via the 3 schema properties --> suggest
>> rename
>> properties to make clear those at processor level are "mandatory check" vs
>> the schema on reader which is the well the data read schema.
>> [1.2] *NOK* - does not allow *types validation**.* *One could thinking
>> using InferSchema right ? problem is it only supports JSON.*
>> [1.2] *NOK* - ignores writer schema where one could supply *subset of
>> original fields* (always export all original fields) --> add property to
>> control export all fields (default) or use writer schema(with subset).
>>
>> *2. processor ConvertRecord*
>> [2.1] *OK* csvreader able to *get schema from header -*-> maybe
>> improve/add
>> property to cleanup fields (regex search/replace - so we can strip
>> whitespaces and anything else that breaks nifi processors and/or that
>> doesnt interest us)
>> [2.2] *NOK* missing *mandatory subset of fields.*
>> [2.3] *OK* but does good jobs converting between formats, and/or *export
>> all or subset of fields via writer schema*.
>>
>> *3. processor InferAvroSchema*
>> [3.1] NOK - despite property "Input Content Type" lists CSV, JSON as
>> inbound data, in reality the property "Number Of Records To Analyze" only
>> supports JSON. Took us 2 days debugging to understand the problem.. 1 CSV
>> with 4k lines and mostly nulls, "1"s or "2"s but some few records would be
>> "true" or "false".. meaning avro data type should have been [null, string]
>> but no.. as we found out, type kept being [null, long] with doc always
>> using 1st data line in CSV to determine field type. This was VERY scaring
>> to find out.. how can it be this was fully working as expected ? We endup
>> needing to add +1 processor to convert CSV into JSON so we could get
>> proper
>> schema.. and even now we still testing, as seems all fields got [string]
>> when some columns should be long.
>>
>> Im not sure the best way to expose this, but im working at enterprise
>> level, and believe me, this small but critical nuances are starting to
>> push
>> the mood on NiFi.
>> But because I felt in love with NiFi and i like the idea of graphical
>> design of flows etc, but we really must fix this critical little devils..
>> they are being screamed as nifi problems at management level.
>> I know nifi is open source, and its upon us developers to improve, i just
>> would like to call attention that we must be sure on the middle of PRs and
>> JIRA enhancements we not forgetting the basic threshold.. doesn't make
>> sense to release a processor with only 50% of its main goal developed when
>> the remaining work would be easy and fast to do (aka InferAvroSchema).
>>
>> As i keep experimenting more and more with NiFi, i start detecting the
>> level of basic quality features is bellow from what i think it should be.
>> Better not release incomplete processors at least regarding core function
>> of the processor.
>>
>> I know developers can contributes with new code, fixes and enhancements..
>> but is there any gatekeeper team double checking the deliverables ? like
>> at
>> basic developer should provide enough unite tests.. again the
>> InferAvroSchema being a processor to export avro schema based on either a
>> CSV or JSON, then obviously there should be couple unit testings CSVs and
>> JSON with different data so we can be sure sure we have the proper type on
>> the avro schema exported right ?
>>
>> Above i share some ideas, and i got much more from my day by day
>> experience
>> that i been working with NiFi at entperise level for more than 1 year by
>> now.
>> Let me know what shall be the way to create JIRAs to fix several
>> processors
>> in order to allow aone unexperienced nifi client developer to accomplish
>> the basic flow of:
>>
>> CSV (*get schema automatically via header line*) --> *validate mandatory
>> subset of fields (presence) and (data types)* --> *export subset of
>> fields*
>> or all (but want some of them obfuscated)
>>
>> I challenge anyone to come out with flows to implement this basic flow..
>> and test and see what i mean,, you will see how incomplete and hard are
>> things.. which should not be the case at all. NiFi shall be true Lego, add
>> processors that says does XPTO and trust it will.. but we keep finding a
>> lot of nuances..
>>
>> I dont mind taking 1 day off my  and work have a meeting with some of you
>> -
>> dont know if theres such a thing as tech lead on nifi project? - and i
>> think would be urgent to fix the foundations of some processors. Let me
>> know..
>>
>>
>>
>> Best Regards,
>> *Emanuel Oliveira*
>>
>

Re: basic enhancements + incomplete processors + when nifi gets more complicated than it should..

Posted by Mike Thomsen <mi...@gmail.com>.

Hi Emanuel,

I think you raise some potentially valid issues that are worth looking at
in more detail. I can say our experience with NiFi is exact opposite, but
part of that is that we are a 100% "schema first" shop. Avro is insanely
easy to learn, and we've gotten junior data engineers up to speed in a
matter of days producing beta quality data contracts that way.

On Sat, Feb 1, 2020 at 12:33 PM Emanuel Oliveira <em...@gmail.com> wrote:

> Hi,
>
> Based on recent experience, I found very hard to implement logic which i
> think should exists out of the box, and instead it was slow process of
> keeping discovering a property on a processor only works for a type of data
> when processor supports multiple types etc.
>
> I would like you all to keep it simple attitude and imagine hwo you would
> implement a basic scenario as:
>
> *basic scenario 1 - shall be easy to implement out of the box following 3
> needs:*
> CSV (*get schema automatically via header line*) --> *validate mandatory
> subset of fields (presence) and (data types)* --> *export subset of fields*
> or all (but want some of them obfuscated)
> problems/workarounds found 1.9 rc3
>
> *1. processor ValidateRecord*
> [1.1] *OK* - allows *getting schema automatically via header line* and
> *mandatory
> subset of fields* (presence) via the 3 schema properties --> suggest rename
> properties to make clear those at processor level are "mandatory check" vs
> the schema on reader which is the well the data read schema.
> [1.2] *NOK* - does not allow *types validation**.* *One could thinking
> using InferSchema right ? problem is it only supports JSON.*
> [1.2] *NOK* - ignores writer schema where one could supply *subset of
> original fields* (always export all original fields) --> add property to
> control export all fields (default) or use writer schema(with subset).
>
> *2. processor ConvertRecord*
> [2.1] *OK* csvreader able to *get schema from header -*-> maybe improve/add
> property to cleanup fields (regex search/replace - so we can strip
> whitespaces and anything else that breaks nifi processors and/or that
> doesnt interest us)
> [2.2] *NOK* missing *mandatory subset of fields.*
> [2.3] *OK* but does good jobs converting between formats, and/or *export
> all or subset of fields via writer schema*.
>
> *3. processor InferAvroSchema*
> [3.1] NOK - despite property "Input Content Type" lists CSV, JSON as
> inbound data, in reality the property "Number Of Records To Analyze" only
> supports JSON. Took us 2 days debugging to understand the problem.. 1 CSV
> with 4k lines and mostly nulls, "1"s or "2"s but some few records would be
> "true" or "false".. meaning avro data type should have been [null, string]
> but no.. as we found out, type kept being [null, long] with doc always
> using 1st data line in CSV to determine field type. This was VERY scaring
> to find out.. how can it be this was fully working as expected ? We endup
> needing to add +1 processor to convert CSV into JSON so we could get proper
> schema.. and even now we still testing, as seems all fields got [string]
> when some columns should be long.
>
> Im not sure the best way to expose this, but im working at enterprise
> level, and believe me, this small but critical nuances are starting to push
> the mood on NiFi.
> But because I felt in love with NiFi and i like the idea of graphical
> design of flows etc, but we really must fix this critical little devils..
> they are being screamed as nifi problems at management level.
> I know nifi is open source, and its upon us developers to improve, i just
> would like to call attention that we must be sure on the middle of PRs and
> JIRA enhancements we not forgetting the basic threshold.. doesn't make
> sense to release a processor with only 50% of its main goal developed when
> the remaining work would be easy and fast to do (aka InferAvroSchema).
>
> As i keep experimenting more and more with NiFi, i start detecting the
> level of basic quality features is bellow from what i think it should be.
> Better not release incomplete processors at least regarding core function
> of the processor.
>
> I know developers can contributes with new code, fixes and enhancements..
> but is there any gatekeeper team double checking the deliverables ? like at
> basic developer should provide enough unite tests.. again the
> InferAvroSchema being a processor to export avro schema based on either a
> CSV or JSON, then obviously there should be couple unit testings CSVs and
> JSON with different data so we can be sure sure we have the proper type on
> the avro schema exported right ?
>
> Above i share some ideas, and i got much more from my day by day experience
> that i been working with NiFi at entperise level for more than 1 year by
> now.
> Let me know what shall be the way to create JIRAs to fix several processors
> in order to allow aone unexperienced nifi client developer to accomplish
> the basic flow of:
>
> CSV (*get schema automatically via header line*) --> *validate mandatory
> subset of fields (presence) and (data types)* --> *export subset of fields*
> or all (but want some of them obfuscated)
>
> I challenge anyone to come out with flows to implement this basic flow..
> and test and see what i mean,, you will see how incomplete and hard are
> things.. which should not be the case at all. NiFi shall be true Lego, add
> processors that says does XPTO and trust it will.. but we keep finding a
> lot of nuances..
>
> I dont mind taking 1 day off my  and work have a meeting with some of you -
> dont know if theres such a thing as tech lead on nifi project? - and i
> think would be urgent to fix the foundations of some processors. Let me
> know..
>
>
>
> Best Regards,
> *Emanuel Oliveira*
>

Re: basic enhancements + incomplete processors + when nifi gets more complicated than it should..

Posted by Martin Ebert <ma...@gmx.de>.

Hi Emanuel,
I can well understand what you mean, but I would see less of a need to
revise the processors mentioned here, and more of a need to finally offer a
standard Delta Lake processor. With this processor you would cover exactly
your use cases.
- Scheme Evolution
- Scheme Enforcement
- Out of the box: Audit history and Time Travel (data versioning)
...
On the roadmap there are still many features that would probably be of
interest to you, too. On YouTube you can find dozens of videos about them.

More on this:
https://delta.io/
And the competitors show how easy it can be: https://youtu.be/VLd_qOrKrTI
If something like this would be offered out of the box in Nifi 1.12, maybe
even better than in the video, then there would be another very good
argument for Nifi.

Currently, I use this concept by running all delta operations on databricks
in notebooks and orchestrate the runs in Nifi. Is a bit of unnecessary
overhead.

Best,
Martin

Emanuel Oliveira <em...@gmail.com> schrieb am Sa., 1. Feb. 2020, 18:33:

> Hi,
>
> Based on recent experience, I found very hard to implement logic which i
> think should exists out of the box, and instead it was slow process of
> keeping discovering a property on a processor only works for a type of data
> when processor supports multiple types etc.
>
> I would like you all to keep it simple attitude and imagine hwo you would
> implement a basic scenario as:
>
> *basic scenario 1 - shall be easy to implement out of the box following 3
> needs:*
> CSV (*get schema automatically via header line*) --> *validate mandatory
> subset of fields (presence) and (data types)* --> *export subset of fields*
> or all (but want some of them obfuscated)
> problems/workarounds found 1.9 rc3
>
> *1. processor ValidateRecord*
> [1.1] *OK* - allows *getting schema automatically via header line* and
> *mandatory
> subset of fields* (presence) via the 3 schema properties --> suggest rename
> properties to make clear those at processor level are "mandatory check" vs
> the schema on reader which is the well the data read schema.
> [1.2] *NOK* - does not allow *types validation**.* *One could thinking
> using InferSchema right ? problem is it only supports JSON.*
> [1.2] *NOK* - ignores writer schema where one could supply *subset of
> original fields* (always export all original fields) --> add property to
> control export all fields (default) or use writer schema(with subset).
>
> *2. processor ConvertRecord*
> [2.1] *OK* csvreader able to *get schema from header -*-> maybe improve/add
> property to cleanup fields (regex search/replace - so we can strip
> whitespaces and anything else that breaks nifi processors and/or that
> doesnt interest us)
> [2.2] *NOK* missing *mandatory subset of fields.*
> [2.3] *OK* but does good jobs converting between formats, and/or *export
> all or subset of fields via writer schema*.
>
> *3. processor InferAvroSchema*
> [3.1] NOK - despite property "Input Content Type" lists CSV, JSON as
> inbound data, in reality the property "Number Of Records To Analyze" only
> supports JSON. Took us 2 days debugging to understand the problem.. 1 CSV
> with 4k lines and mostly nulls, "1"s or "2"s but some few records would be
> "true" or "false".. meaning avro data type should have been [null, string]
> but no.. as we found out, type kept being [null, long] with doc always
> using 1st data line in CSV to determine field type. This was VERY scaring
> to find out.. how can it be this was fully working as expected ? We endup
> needing to add +1 processor to convert CSV into JSON so we could get proper
> schema.. and even now we still testing, as seems all fields got [string]
> when some columns should be long.
>
> Im not sure the best way to expose this, but im working at enterprise
> level, and believe me, this small but critical nuances are starting to push
> the mood on NiFi.
> But because I felt in love with NiFi and i like the idea of graphical
> design of flows etc, but we really must fix this critical little devils..
> they are being screamed as nifi problems at management level.
> I know nifi is open source, and its upon us developers to improve, i just
> would like to call attention that we must be sure on the middle of PRs and
> JIRA enhancements we not forgetting the basic threshold.. doesn't make
> sense to release a processor with only 50% of its main goal developed when
> the remaining work would be easy and fast to do (aka InferAvroSchema).
>
> As i keep experimenting more and more with NiFi, i start detecting the
> level of basic quality features is bellow from what i think it should be.
> Better not release incomplete processors at least regarding core function
> of the processor.
>
> I know developers can contributes with new code, fixes and enhancements..
> but is there any gatekeeper team double checking the deliverables ? like at
> basic developer should provide enough unite tests.. again the
> InferAvroSchema being a processor to export avro schema based on either a
> CSV or JSON, then obviously there should be couple unit testings CSVs and
> JSON with different data so we can be sure sure we have the proper type on
> the avro schema exported right ?
>
> Above i share some ideas, and i got much more from my day by day experience
> that i been working with NiFi at entperise level for more than 1 year by
> now.
> Let me know what shall be the way to create JIRAs to fix several processors
> in order to allow aone unexperienced nifi client developer to accomplish
> the basic flow of:
>
> CSV (*get schema automatically via header line*) --> *validate mandatory
> subset of fields (presence) and (data types)* --> *export subset of fields*
> or all (but want some of them obfuscated)
>
> I challenge anyone to come out with flows to implement this basic flow..
> and test and see what i mean,, you will see how incomplete and hard are
> things.. which should not be the case at all. NiFi shall be true Lego, add
> processors that says does XPTO and trust it will.. but we keep finding a
> lot of nuances..
>
> I dont mind taking 1 day off my  and work have a meeting with some of you -
> dont know if theres such a thing as tech lead on nifi project? - and i
> think would be urgent to fix the foundations of some processors. Let me
> know..
>
>
>
> Best Regards,
> *Emanuel Oliveira*
>

Re: basic enhancements + incomplete processors + when nifi gets more complicated than it should..

Posted by Emanuel Oliveira <em...@gmail.com>.

Hi Otto,

I can do that ok, what do i need ? I guess need to become a contributor in
order to have a login in Jira ?
Can you please share step by step what I need to do ? I only subscribed to
user and dev mailing list.
Im interested in improving NiFi as a power user, can now move to next step
I guess, access to JIRA and open tickets after community suggested from
discussion in user/dev mailing lists that makes sense.
Thanks.

Best Regards,
*Emanuel Oliveira*




On Sun, Feb 2, 2020 at 2:01 PM Otto Fowler <ot...@gmail.com> wrote:

> I hope you entered Jira issues with your great feedback!
>
>
>
>
> On February 1, 2020 at 12:33:44, Emanuel Oliveira (emanueol@gmail.com)
> wrote:
>
> Hi,
>
> Based on recent experience, I found very hard to implement logic which i
> think should exists out of the box, and instead it was slow process of
> keeping discovering a property on a processor only works for a type of data
> when processor supports multiple types etc.
>
> I would like you all to keep it simple attitude and imagine hwo you would
> implement a basic scenario as:
>
> *basic scenario 1 - shall be easy to implement out of the box following 3
> needs:*
> CSV (*get schema automatically via header line*) --> *validate mandatory
> subset of fields (presence) and (data types)* --> *export subset of fields*
> or all (but want some of them obfuscated)
> problems/workarounds found 1.9 rc3
>
> *1. processor ValidateRecord*
> [1.1] *OK* - allows *getting schema automatically via header line* and
> *mandatory
> subset of fields* (presence) via the 3 schema properties --> suggest rename
> properties to make clear those at processor level are "mandatory check" vs
> the schema on reader which is the well the data read schema.
> [1.2] *NOK* - does not allow *types validation**.* *One could thinking
> using InferSchema right ? problem is it only supports JSON.*
> [1.2] *NOK* - ignores writer schema where one could supply *subset of
> original fields* (always export all original fields) --> add property to
> control export all fields (default) or use writer schema(with subset).
>
> *2. processor ConvertRecord*
> [2.1] *OK* csvreader able to *get schema from header -*-> maybe improve/add
> property to cleanup fields (regex search/replace - so we can strip
> whitespaces and anything else that breaks nifi processors and/or that
> doesnt interest us)
> [2.2] *NOK* missing *mandatory subset of fields.*
> [2.3] *OK* but does good jobs converting between formats, and/or *export
> all or subset of fields via writer schema*.
>
> *3. processor InferAvroSchema*
> [3.1] NOK - despite property "Input Content Type" lists CSV, JSON as
> inbound data, in reality the property "Number Of Records To Analyze" only
> supports JSON. Took us 2 days debugging to understand the problem.. 1 CSV
> with 4k lines and mostly nulls, "1"s or "2"s but some few records would be
> "true" or "false".. meaning avro data type should have been [null, string]
> but no.. as we found out, type kept being [null, long] with doc always
> using 1st data line in CSV to determine field type. This was VERY scaring
> to find out.. how can it be this was fully working as expected ? We endup
> needing to add +1 processor to convert CSV into JSON so we could get proper
> schema.. and even now we still testing, as seems all fields got [string]
> when some columns should be long.
>
> Im not sure the best way to expose this, but im working at enterprise
> level, and believe me, this small but critical nuances are starting to push
> the mood on NiFi.
> But because I felt in love with NiFi and i like the idea of graphical
> design of flows etc, but we really must fix this critical little devils..
> they are being screamed as nifi problems at management level.
> I know nifi is open source, and its upon us developers to improve, i just
> would like to call attention that we must be sure on the middle of PRs and
> JIRA enhancements we not forgetting the basic threshold.. doesn't make
> sense to release a processor with only 50% of its main goal developed when
> the remaining work would be easy and fast to do (aka InferAvroSchema).
>
> As i keep experimenting more and more with NiFi, i start detecting the
> level of basic quality features is bellow from what i think it should be.
> Better not release incomplete processors at least regarding core function
> of the processor.
>
> I know developers can contributes with new code, fixes and enhancements..
> but is there any gatekeeper team double checking the deliverables ? like at
> basic developer should provide enough unite tests.. again the
> InferAvroSchema being a processor to export avro schema based on either a
> CSV or JSON, then obviously there should be couple unit testings CSVs and
> JSON with different data so we can be sure sure we have the proper type on
> the avro schema exported right ?
>
> Above i share some ideas, and i got much more from my day by day experience
> that i been working with NiFi at entperise level for more than 1 year by
> now.
> Let me know what shall be the way to create JIRAs to fix several processors
> in order to allow aone unexperienced nifi client developer to accomplish
> the basic flow of:
>
> CSV (*get schema automatically via header line*) --> *validate mandatory
> subset of fields (presence) and (data types)* --> *export subset of fields*
> or all (but want some of them obfuscated)
>
> I challenge anyone to come out with flows to implement this basic flow..
> and test and see what i mean,, you will see how incomplete and hard are
> things.. which should not be the case at all. NiFi shall be true Lego, add
> processors that says does XPTO and trust it will.. but we keep finding a
> lot of nuances..
>
> I dont mind taking 1 day off my and work have a meeting with some of you -
> dont know if theres such a thing as tech lead on nifi project? - and i
> think would be urgent to fix the foundations of some processors. Let me
> know..
>
>
>
> Best Regards,
> *Emanuel Oliveira*
>

Re: basic enhancements + incomplete processors + when nifi gets more complicated than it should..

Posted by Martin Ebert <ma...@gmx.de>.

Take a look.
https://issues.apache.org/jira/browse/NIFI-6976

Otto Fowler <ot...@gmail.com> schrieb am So., 2. Feb. 2020, 15:01:

> I hope you entered Jira issues with your great feedback!
>
>
>
>
> On February 1, 2020 at 12:33:44, Emanuel Oliveira (emanueol@gmail.com)
> wrote:
>
> Hi,
>
> Based on recent experience, I found very hard to implement logic which i
> think should exists out of the box, and instead it was slow process of
> keeping discovering a property on a processor only works for a type of data
> when processor supports multiple types etc.
>
> I would like you all to keep it simple attitude and imagine hwo you would
> implement a basic scenario as:
>
> *basic scenario 1 - shall be easy to implement out of the box following 3
> needs:*
> CSV (*get schema automatically via header line*) --> *validate mandatory
> subset of fields (presence) and (data types)* --> *export subset of fields*
> or all (but want some of them obfuscated)
> problems/workarounds found 1.9 rc3
>
> *1. processor ValidateRecord*
> [1.1] *OK* - allows *getting schema automatically via header line* and
> *mandatory
> subset of fields* (presence) via the 3 schema properties --> suggest rename
> properties to make clear those at processor level are "mandatory check" vs
> the schema on reader which is the well the data read schema.
> [1.2] *NOK* - does not allow *types validation**.* *One could thinking
> using InferSchema right ? problem is it only supports JSON.*
> [1.2] *NOK* - ignores writer schema where one could supply *subset of
> original fields* (always export all original fields) --> add property to
> control export all fields (default) or use writer schema(with subset).
>
> *2. processor ConvertRecord*
> [2.1] *OK* csvreader able to *get schema from header -*-> maybe improve/add
> property to cleanup fields (regex search/replace - so we can strip
> whitespaces and anything else that breaks nifi processors and/or that
> doesnt interest us)
> [2.2] *NOK* missing *mandatory subset of fields.*
> [2.3] *OK* but does good jobs converting between formats, and/or *export
> all or subset of fields via writer schema*.
>
> *3. processor InferAvroSchema*
> [3.1] NOK - despite property "Input Content Type" lists CSV, JSON as
> inbound data, in reality the property "Number Of Records To Analyze" only
> supports JSON. Took us 2 days debugging to understand the problem.. 1 CSV
> with 4k lines and mostly nulls, "1"s or "2"s but some few records would be
> "true" or "false".. meaning avro data type should have been [null, string]
> but no.. as we found out, type kept being [null, long] with doc always
> using 1st data line in CSV to determine field type. This was VERY scaring
> to find out.. how can it be this was fully working as expected ? We endup
> needing to add +1 processor to convert CSV into JSON so we could get proper
> schema.. and even now we still testing, as seems all fields got [string]
> when some columns should be long.
>
> Im not sure the best way to expose this, but im working at enterprise
> level, and believe me, this small but critical nuances are starting to push
> the mood on NiFi.
> But because I felt in love with NiFi and i like the idea of graphical
> design of flows etc, but we really must fix this critical little devils..
> they are being screamed as nifi problems at management level.
> I know nifi is open source, and its upon us developers to improve, i just
> would like to call attention that we must be sure on the middle of PRs and
> JIRA enhancements we not forgetting the basic threshold.. doesn't make
> sense to release a processor with only 50% of its main goal developed when
> the remaining work would be easy and fast to do (aka InferAvroSchema).
>
> As i keep experimenting more and more with NiFi, i start detecting the
> level of basic quality features is bellow from what i think it should be.
> Better not release incomplete processors at least regarding core function
> of the processor.
>
> I know developers can contributes with new code, fixes and enhancements..
> but is there any gatekeeper team double checking the deliverables ? like at
> basic developer should provide enough unite tests.. again the
> InferAvroSchema being a processor to export avro schema based on either a
> CSV or JSON, then obviously there should be couple unit testings CSVs and
> JSON with different data so we can be sure sure we have the proper type on
> the avro schema exported right ?
>
> Above i share some ideas, and i got much more from my day by day experience
> that i been working with NiFi at entperise level for more than 1 year by
> now.
> Let me know what shall be the way to create JIRAs to fix several processors
> in order to allow aone unexperienced nifi client developer to accomplish
> the basic flow of:
>
> CSV (*get schema automatically via header line*) --> *validate mandatory
> subset of fields (presence) and (data types)* --> *export subset of fields*
> or all (but want some of them obfuscated)
>
> I challenge anyone to come out with flows to implement this basic flow..
> and test and see what i mean,, you will see how incomplete and hard are
> things.. which should not be the case at all. NiFi shall be true Lego, add
> processors that says does XPTO and trust it will.. but we keep finding a
> lot of nuances..
>
> I dont mind taking 1 day off my and work have a meeting with some of you -
> dont know if theres such a thing as tech lead on nifi project? - and i
> think would be urgent to fix the foundations of some processors. Let me
> know..
>
>
>
> Best Regards,
> *Emanuel Oliveira*
>

Re: basic enhancements + incomplete processors + when nifi gets more complicated than it should..

Posted by Otto Fowler <ot...@gmail.com>.

I hope you entered Jira issues with your great feedback!




On February 1, 2020 at 12:33:44, Emanuel Oliveira (emanueol@gmail.com)
wrote:

Hi,

Based on recent experience, I found very hard to implement logic which i
think should exists out of the box, and instead it was slow process of
keeping discovering a property on a processor only works for a type of data
when processor supports multiple types etc.

I would like you all to keep it simple attitude and imagine hwo you would
implement a basic scenario as:

*basic scenario 1 - shall be easy to implement out of the box following 3
needs:*
CSV (*get schema automatically via header line*) --> *validate mandatory
subset of fields (presence) and (data types)* --> *export subset of fields*
or all (but want some of them obfuscated)
problems/workarounds found 1.9 rc3

*1. processor ValidateRecord*
[1.1] *OK* - allows *getting schema automatically via header line* and
*mandatory
subset of fields* (presence) via the 3 schema properties --> suggest rename
properties to make clear those at processor level are "mandatory check" vs
the schema on reader which is the well the data read schema.
[1.2] *NOK* - does not allow *types validation**.* *One could thinking
using InferSchema right ? problem is it only supports JSON.*
[1.2] *NOK* - ignores writer schema where one could supply *subset of
original fields* (always export all original fields) --> add property to
control export all fields (default) or use writer schema(with subset).

*2. processor ConvertRecord*
[2.1] *OK* csvreader able to *get schema from header -*-> maybe improve/add
property to cleanup fields (regex search/replace - so we can strip
whitespaces and anything else that breaks nifi processors and/or that
doesnt interest us)
[2.2] *NOK* missing *mandatory subset of fields.*
[2.3] *OK* but does good jobs converting between formats, and/or *export
all or subset of fields via writer schema*.

*3. processor InferAvroSchema*
[3.1] NOK - despite property "Input Content Type" lists CSV, JSON as
inbound data, in reality the property "Number Of Records To Analyze" only
supports JSON. Took us 2 days debugging to understand the problem.. 1 CSV
with 4k lines and mostly nulls, "1"s or "2"s but some few records would be
"true" or "false".. meaning avro data type should have been [null, string]
but no.. as we found out, type kept being [null, long] with doc always
using 1st data line in CSV to determine field type. This was VERY scaring
to find out.. how can it be this was fully working as expected ? We endup
needing to add +1 processor to convert CSV into JSON so we could get proper
schema.. and even now we still testing, as seems all fields got [string]
when some columns should be long.

Im not sure the best way to expose this, but im working at enterprise
level, and believe me, this small but critical nuances are starting to push
the mood on NiFi.
But because I felt in love with NiFi and i like the idea of graphical
design of flows etc, but we really must fix this critical little devils..
they are being screamed as nifi problems at management level.
I know nifi is open source, and its upon us developers to improve, i just
would like to call attention that we must be sure on the middle of PRs and
JIRA enhancements we not forgetting the basic threshold.. doesn't make
sense to release a processor with only 50% of its main goal developed when
the remaining work would be easy and fast to do (aka InferAvroSchema).

As i keep experimenting more and more with NiFi, i start detecting the
level of basic quality features is bellow from what i think it should be.
Better not release incomplete processors at least regarding core function
of the processor.

I know developers can contributes with new code, fixes and enhancements..
but is there any gatekeeper team double checking the deliverables ? like at
basic developer should provide enough unite tests.. again the
InferAvroSchema being a processor to export avro schema based on either a
CSV or JSON, then obviously there should be couple unit testings CSVs and
JSON with different data so we can be sure sure we have the proper type on
the avro schema exported right ?

Above i share some ideas, and i got much more from my day by day experience
that i been working with NiFi at entperise level for more than 1 year by
now.
Let me know what shall be the way to create JIRAs to fix several processors
in order to allow aone unexperienced nifi client developer to accomplish
the basic flow of:

CSV (*get schema automatically via header line*) --> *validate mandatory
subset of fields (presence) and (data types)* --> *export subset of fields*
or all (but want some of them obfuscated)

I challenge anyone to come out with flows to implement this basic flow..
and test and see what i mean,, you will see how incomplete and hard are
things.. which should not be the case at all. NiFi shall be true Lego, add
processors that says does XPTO and trust it will.. but we keep finding a
lot of nuances..

I dont mind taking 1 day off my and work have a meeting with some of you -
dont know if theres such a thing as tech lead on nifi project? - and i
think would be urgent to fix the foundations of some processors. Let me
know..



Best Regards,
*Emanuel Oliveira*