You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nifi.apache.org by Mark Payne <ma...@hotmail.com> on 2016/09/04 00:50:31 UTC

Re: Looking for feedback on my WIP Design

Hey Peter,

Finally getting a chance to take a look at some of this stuff. Very cool that you're jumping in on this! I think
this has the potential to really enable a lot of great new possibilities very quickly and easily. Matt B. and I
discussed this concept a while back on the mailing list a bit but didn't get into great detail. He may have
some great thoughts that he can contribute, as well. I jotted down some thoughts that I had on the design
as I looked over it...

* I think we should not really think of these are record conversions, but rather two distinct concepts - 
record parsing and record serialization. This opens up a lot more possibilities, as we could use a Record
Parser service for a more generic PutHBase processors, for instance. Likely, we could use the Record
Serialization service with a source processor like ExecuteSQL. And then a processor that wants to convert
from one format to another could use both a parser and a serializer. I don't think the RecordConversionService
is really needed, as I think it makes more sense to just have a single processor to do conversion and these
service API's would be used by this processor.

* Rather than having a RecordSet that contains a Map<String, Field> getSchema() and a lot of "getSupportsXYZ()" types
of methods, I think we should instead have a "Schema getSchema()" method and move the description of what is supported
into the Schema object.

* I have written record parsing libraries before, and while the natural tendency is to create a structure such as Record that has
a bunch of Field objects, this can actually cripple performance when running on large volumes of records. For instance, if you
were running just 50 MB/sec. of incoming data, and lets say that you have fields that average 100 bytes each, you're talking half
a million Field objects per second that get created and have to be garbage collected. And these are fairly conservative numbers -
you could easily reach many million Field objects per second. So the approach that I would recommend is instead of having Record
return a bunch of Field objects, instead have accessor methods on Record such as:

getString(int fieldIndex);
getString(String fieldName);
getInt(int fieldIndex);
getInt(String fieldName);
getType(int fieldIndex);
getName(int fieldIndex);

* I would also expect that calling getString(1) would convert the first field to a String by using String.vlaueOf or something like that.
I would expect some sort of inferred conversion, but the impl you have here just does a type-cast, which can lead to a lot of redundant
code being used by the clients of the library.

* If Record is going to have a 'getUnderlyingObject' method that returns an Object, it may make more sense to instead make Record generic
and return an object of type T?

I know this is quite a bit of feedback. Hopefully not too overwhelming :) There's a whole lot of good stuff here, and I think this is going
to turn out to be an awesome addition to NiFi. Very excited about it!


-Mark



> On Aug 23, 2016, at 1:00 PM, Peter Wicks (pwicks) <pw...@micron.com> wrote:
> 
> Matt,
> 
> I've re-architected it a bit.  I'll admit that I haven't tested this new design, but it makes sense, wanted to get your feedback.  Let me know if this somewhat lines up with your vision.
> 
> https://github.com/patricker/nifi/tree/RowProcessorService
> 
> Then we can have cool processors like `RecordSetConverter` that can convert between arbitrary types, and `RecordSetToSQL` etc...  Could be really cool.
> 
> --Peter
> 
> -----Original Message-----
> From: Peter Wicks (pwicks) 
> Sent: Monday, August 22, 2016 8:57 PM
> To: 'dev@nifi.apache.org' <de...@nifi.apache.org>
> Subject: RE: Looking for feedback on my WIP Design
> 
> Matt,
> 
> When you put it that way it sounds like we'll be taking over the world by breakfast...
> 
> It sounds like this has a lot of other potential uses, such as a generic conversion processor instead of all these one-off a-to-b conversion processors that are floating around. I hadn't really thought that far, but I can see you went there straight away.
> 
> I'll put some more work into it along the lines you outlined and check back in.
> 
> Thanks,
>  Peter
> 
> -----Original Message-----
> From: Matt Burgess [mailto:mattyb149@gmail.com] 
> Sent: Monday, August 22, 2016 7:58 PM
> To: dev@nifi.apache.org
> Subject: Re: Looking for feedback on my WIP Design
> 
> Peter,
> 
> First of all, great work! I couldn't find an Apache Jira for this, but I remember seeing something in the dev email list about perhaps having a ControllerService for arbitrary conversions.
> 
> I took a look at the commit; first things first, looks good for the use case, thanks much!  A handful of notes:
> 
> - As you know, the critical part is your "RowConverter".  To be most useful, we should make sure (like Java has, over time, with their supported SQL types), we can refer to structured types in some common language.  So "startNewFile()" might be better as "newRecord()", and
> addRow() might be better expressed as a NiFi-defined interface.
> ResultSets could have a simple wrapper implementing the generic interface that would let the code consume it the same as any other object.
> 
> - For a common language, perhaps we could refer to structured data as "records" and individual fields (also perhaps nested) as "fields".
> 
> - With a common NiFi-defined API for getting records and fields, we could implement all the converters you mention without any explicit dependency on an implementation NAR/JAR.
> 
> - We should avoid the explicit conversion of input format to any intermediate format; in other words, the interface should follow the Adapter or Facade pattern, and should convert-on-read.
> 
> - I'm sure I've forgotten some of the details between format details, but I'll list them for discussion with respect to conversions:
> 
> 1) XML
> 2) JSON
> 3) CSV (and TSV and other text-based delimited files)
> 4) SQL ResultSet
> 5) CQL ResultSet (Cassandra)
> 6) Elasticsearch result set
> 7) Avro
> 8) Parquet
> 
> If we can cover these with a single interface and 8(-ish) implementations, then I think we're in good shape for world domination
> :) Not being sarcastic, I'm saying let's make it happen!
> 
> Regards,
> Matt
> 
> On Mon, Aug 22, 2016 at 9:32 PM, Peter Wicks (pwicks) <pw...@micron.com> wrote:
>> I'm working on a change to QueryDatabaseTable (and eventually would apply to ExecuteSQL) to allow users to choose the output format, so something besides just Avro.  I'm planning to put in a ticket soon for my work.
>> 
>> Goals:
>> 
>> -          If this update goes out no user should be affected, as defaults will work the way they have before.
>> 
>> -          Don't want to muck up dependencies of standard processors with lots of other libraries to support multiple formats. As such I have implemented it using Controller Services to make the converters pluggable.
>> 
>> -          Would like to support these output formats to start:
>> 
>> o   Avro (This one is already in my commit, as I copy/pasted the code over; but the logic has now been moved to an appropriate library, which I like)
>> 
>> o   ORC (Would be implemented in the HIVE library)
>> 
>> o   JSON (No idea where I would put this processor, unless it's in a new module)
>> 
>> o   Delimited (No idea where I would put this processor, unless it's in a new module)
>> 
>> Here is my commit: 
>> https://github.com/patricker/nifi/commit/01a79804f60b6be0f86499949712c
>> d9118fb4f7f
>> 
>> I'd appreciate feedback on design/implementation.  I have the guts in there of how I was planning to implement this.
>> 
>> Thanks,
>>  Peter


RE: Looking for feedback on my WIP Design

Posted by "Peter Wicks (pwicks)" <pw...@micron.com>.
Mark,

It doesn't have everything discussed, but take a look at my updated version (ignore ORC).

https://github.com/patricker/nifi/tree/RowProcessorService

--Peter

-----Original Message-----
From: Mark Payne [mailto:markap14@hotmail.com] 
Sent: Tuesday, September 06, 2016 12:47 PM
To: dev@nifi.apache.org
Subject: Re: Looking for feedback on my WIP Design

Hey Peter,

Thanks for the quick reply!

So the way that you're outlining the RecordConversionService here, I believe the idea is that this service is more of a "factory" for creating a parser or a serializer? Is that accurate? If so I'm okay with that. The RecordConverter name sounds to me like something you would give a JSON blob and it'll split out an XML blob or something like that, though...
it may just be the terminology that I'm not understanding. Maybe some JavaDocs will clear that all up once we understand the direction that we're heading better.

But if that is the idea, to have this work as a factory, I would recommend we also have an isRecordConverterSupported() and
isRecordSetSupported() type of method, as I can definitely see someone implementing some sort of parsing functionality but not providing the ability to write data out in a specific format (for example, read metadata from a PDF file but no ability to write arbitrary data to a PDF or something like that).

My suggestion of using String.valueOf was to account for the case when the value that you're obtaining is not actually a String.
If, for instance, I want to call getString() and the value is actually a Long, then String.valueOf() would return a String representation.
Attempting to cast would throw a ClassCastException. Similarly, if I wanted to use getLong() and the value was an int, we should certainly be able to convert the int to a long, but attempting to cast it explicitly will throw ClassCastException, and this can lead to some really odd problems when debugging.

Now, if the intent is to always use getObject() then I would recommend that we get rid of the getString(), getInteger(), etc. all together and allow the user to explicitly cast it themselves if they need it cast. But I think it would be fairly common to want to convert these value types, so it seems reasonable to me to include this conversion in the lib itself.

I do like the getUnderlyingObject() actually. Imagine the case where we want to run some filtering logic. We may want to get in a record, check if it passes the filter, and if so, pass on the original object to the serializer.

Thanks
-Mark

> On Sep 6, 2016, at 1:38 PM, Peter Wicks (pwicks) <pw...@micron.com> wrote:
> 
> Mark,
> 
> Thanks for the overwhelming feedback :)
> 
> I mostly agree with what you've written.  My plan has been to allow adhoc use of the parsers/serializers.  I haven't pushed to Github in a bit so I don't 100% remember what's out there, but right now my ControllerService is setup so that you can use it for either/both:
> 
> public interface RecordConversionService extends ControllerService {
>    RecordConverter getRecordConverter(ProcessContext context)  throws ProcessException;
>    RecordSet getRecordSet(FlowFile fIn, ProcessSession session); }
> 
> Maybe it just needs a rename? Maybe just call it `RecordService`.  If you think it makes more sense from a usability/API standpoint I have no problem with splitting this up into two ControllerServices (parse/serialize).
> 
> On Schemas and Fields, I completely agree.  It felt a little off to me, and doing it your way makes more sense, and also aligns with how other implementations do it (ResultSet/Avro/etc...).
> 
> You did slightly confuse me when you brought up `String.valueOf`, since String.valueOf just calls Object.toString internally, and calling toString on an Object that is already a String returns the same object reference (no conversion of any kind); so at that point String.valueOf(o) and (String)o appear equivalent.
> 
> Also, while value types like Integer do have a `valueOf` method, they generally accept Strings, so if you take that route and convert the type to String and then back to the actual object type you risk losing data due to Java default String formatting (hypothetically). Whereas if we know it's already of that Object type under the hood we can cast directly.  Can you be a bit more concise on how you'd like to see data type conversions and what the issue is with direct casts?  I think all of the automated interfaces will just use getObject anyways, the other methods are more for custom processor development.
> 
> On getUnderlyingObject, I don't think having it generic really helps anyone.  I wasn't going to include this getter at all, but when I was working on my original code I was working with QueryDatabaseTable and it needs to pass the ResultSet object to the Max Value collector, so I needed an easy way to access it.  If you need the underlying object internally (for this implementation specific parser/serializer), you will probably just reference the typed field copy or an implementation specific getter. If you are accessing it externally (custom processor), you'll be using interfaces and so you'd have to know exactly what type the underlying object was to begin with for it to help you, or put in custom logic based on the name of the converter.
> 
> Thanks,
>  Peter
> 
> -----Original Message-----
> From: Mark Payne [mailto:markap14@hotmail.com]
> Sent: Saturday, September 03, 2016 6:51 PM
> To: dev@nifi.apache.org
> Subject: Re: Looking for feedback on my WIP Design
> 
> Hey Peter,
> 
> Finally getting a chance to take a look at some of this stuff. Very cool that you're jumping in on this! I think this has the potential to really enable a lot of great new possibilities very quickly and easily. Matt B. and I discussed this concept a while back on the mailing list a bit but didn't get into great detail. He may have some great thoughts that he can contribute, as well. I jotted down some thoughts that I had on the design as I looked over it...
> 
> * I think we should not really think of these are record conversions, but rather two distinct concepts - record parsing and record serialization. This opens up a lot more possibilities, as we could use a Record Parser service for a more generic PutHBase processors, for instance. Likely, we could use the Record Serialization service with a source processor like ExecuteSQL. And then a processor that wants to convert from one format to another could use both a parser and a serializer. I don't think the RecordConversionService is really needed, as I think it makes more sense to just have a single processor to do conversion and these service API's would be used by this processor.
> 
> * Rather than having a RecordSet that contains a Map<String, Field> getSchema() and a lot of "getSupportsXYZ()" types of methods, I think we should instead have a "Schema getSchema()" method and move the description of what is supported into the Schema object.
> 
> * I have written record parsing libraries before, and while the natural tendency is to create a structure such as Record that has a bunch of Field objects, this can actually cripple performance when running on large volumes of records. For instance, if you were running just 50 MB/sec. of incoming data, and lets say that you have fields that average 100 bytes each, you're talking half a million Field objects per second that get created and have to be garbage collected. And these are fairly conservative numbers - you could easily reach many million Field objects per second. So the approach that I would recommend is instead of having Record return a bunch of Field objects, instead have accessor methods on Record such as:
> 
> getString(int fieldIndex);
> getString(String fieldName);
> getInt(int fieldIndex);
> getInt(String fieldName);
> getType(int fieldIndex);
> getName(int fieldIndex);
> 
> * I would also expect that calling getString(1) would convert the first field to a String by using String.vlaueOf or something like that.
> I would expect some sort of inferred conversion, but the impl you have here just does a type-cast, which can lead to a lot of redundant code being used by the clients of the library.
> 
> * If Record is going to have a 'getUnderlyingObject' method that returns an Object, it may make more sense to instead make Record generic and return an object of type T?
> 
> I know this is quite a bit of feedback. Hopefully not too overwhelming :) There's a whole lot of good stuff here, and I think this is going to turn out to be an awesome addition to NiFi. Very excited about it!
> 
> 
> -Mark
> 
> 
> 
>> On Aug 23, 2016, at 1:00 PM, Peter Wicks (pwicks) <pw...@micron.com> wrote:
>> 
>> Matt,
>> 
>> I've re-architected it a bit.  I'll admit that I haven't tested this new design, but it makes sense, wanted to get your feedback.  Let me know if this somewhat lines up with your vision.
>> 
>> https://github.com/patricker/nifi/tree/RowProcessorService
>> 
>> Then we can have cool processors like `RecordSetConverter` that can convert between arbitrary types, and `RecordSetToSQL` etc...  Could be really cool.
>> 
>> --Peter
>> 
>> -----Original Message-----
>> From: Peter Wicks (pwicks)
>> Sent: Monday, August 22, 2016 8:57 PM
>> To: 'dev@nifi.apache.org' <de...@nifi.apache.org>
>> Subject: RE: Looking for feedback on my WIP Design
>> 
>> Matt,
>> 
>> When you put it that way it sounds like we'll be taking over the world by breakfast...
>> 
>> It sounds like this has a lot of other potential uses, such as a generic conversion processor instead of all these one-off a-to-b conversion processors that are floating around. I hadn't really thought that far, but I can see you went there straight away.
>> 
>> I'll put some more work into it along the lines you outlined and check back in.
>> 
>> Thanks,
>> Peter
>> 
>> -----Original Message-----
>> From: Matt Burgess [mailto:mattyb149@gmail.com]
>> Sent: Monday, August 22, 2016 7:58 PM
>> To: dev@nifi.apache.org
>> Subject: Re: Looking for feedback on my WIP Design
>> 
>> Peter,
>> 
>> First of all, great work! I couldn't find an Apache Jira for this, but I remember seeing something in the dev email list about perhaps having a ControllerService for arbitrary conversions.
>> 
>> I took a look at the commit; first things first, looks good for the use case, thanks much!  A handful of notes:
>> 
>> - As you know, the critical part is your "RowConverter".  To be most 
>> useful, we should make sure (like Java has, over time, with their 
>> supported SQL types), we can refer to structured types in some common 
>> language.  So "startNewFile()" might be better as "newRecord()", and
>> addRow() might be better expressed as a NiFi-defined interface.
>> ResultSets could have a simple wrapper implementing the generic interface that would let the code consume it the same as any other object.
>> 
>> - For a common language, perhaps we could refer to structured data as "records" and individual fields (also perhaps nested) as "fields".
>> 
>> - With a common NiFi-defined API for getting records and fields, we could implement all the converters you mention without any explicit dependency on an implementation NAR/JAR.
>> 
>> - We should avoid the explicit conversion of input format to any intermediate format; in other words, the interface should follow the Adapter or Facade pattern, and should convert-on-read.
>> 
>> - I'm sure I've forgotten some of the details between format details, but I'll list them for discussion with respect to conversions:
>> 
>> 1) XML
>> 2) JSON
>> 3) CSV (and TSV and other text-based delimited files)
>> 4) SQL ResultSet
>> 5) CQL ResultSet (Cassandra)
>> 6) Elasticsearch result set
>> 7) Avro
>> 8) Parquet
>> 
>> If we can cover these with a single interface and 8(-ish) 
>> implementations, then I think we're in good shape for world 
>> domination
>> :) Not being sarcastic, I'm saying let's make it happen!
>> 
>> Regards,
>> Matt
>> 
>> On Mon, Aug 22, 2016 at 9:32 PM, Peter Wicks (pwicks) <pw...@micron.com> wrote:
>>> I'm working on a change to QueryDatabaseTable (and eventually would apply to ExecuteSQL) to allow users to choose the output format, so something besides just Avro.  I'm planning to put in a ticket soon for my work.
>>> 
>>> Goals:
>>> 
>>> -          If this update goes out no user should be affected, as defaults will work the way they have before.
>>> 
>>> -          Don't want to muck up dependencies of standard processors with lots of other libraries to support multiple formats. As such I have implemented it using Controller Services to make the converters pluggable.
>>> 
>>> -          Would like to support these output formats to start:
>>> 
>>> o   Avro (This one is already in my commit, as I copy/pasted the code over; but the logic has now been moved to an appropriate library, which I like)
>>> 
>>> o   ORC (Would be implemented in the HIVE library)
>>> 
>>> o   JSON (No idea where I would put this processor, unless it's in a new module)
>>> 
>>> o   Delimited (No idea where I would put this processor, unless it's in a new module)
>>> 
>>> Here is my commit: 
>>> https://github.com/patricker/nifi/commit/01a79804f60b6be0f8649994971
>>> 2
>>> c
>>> d9118fb4f7f
>>> 
>>> I'd appreciate feedback on design/implementation.  I have the guts in there of how I was planning to implement this.
>>> 
>>> Thanks,
>>> Peter
> 


Re: Looking for feedback on my WIP Design

Posted by Mark Payne <ma...@hotmail.com>.
Hey Peter,

Thanks for the quick reply!

So the way that you're outlining the RecordConversionService here, I believe the idea is that this service
is more of a "factory" for creating a parser or a serializer? Is that accurate? If so I'm okay with that. The RecordConverter
name sounds to me like something you would give a JSON blob and it'll split out an XML blob or something like that, though...
it may just be the terminology that I'm not understanding. Maybe some JavaDocs will clear that all up once we understand
the direction that we're heading better.

But if that is the idea, to have this work as a factory, I would recommend we also have an isRecordConverterSupported() and
isRecordSetSupported() type of method, as I can definitely see someone implementing some sort of parsing functionality but
not providing the ability to write data out in a specific format (for example, read metadata from a PDF file but no ability to write
arbitrary data to a PDF or something like that).

My suggestion of using String.valueOf was to account for the case when the value that you're obtaining is not actually a String.
If, for instance, I want to call getString() and the value is actually a Long, then String.valueOf() would return a String representation.
Attempting to cast would throw a ClassCastException. Similarly, if I wanted to use getLong() and the value was an int, we should certainly
be able to convert the int to a long, but attempting to cast it explicitly will throw ClassCastException, and this can lead to some really
odd problems when debugging.

Now, if the intent is to always use getObject() then I would recommend that we get rid of the getString(), getInteger(), etc. all together
and allow the user to explicitly cast it themselves if they need it cast. But I think it would be fairly common to want to convert these value
types, so it seems reasonable to me to include this conversion in the lib itself.

I do like the getUnderlyingObject() actually. Imagine the case where we want to run some filtering logic. We may want to get in
a record, check if it passes the filter, and if so, pass on the original object to the serializer.

Thanks
-Mark

> On Sep 6, 2016, at 1:38 PM, Peter Wicks (pwicks) <pw...@micron.com> wrote:
> 
> Mark,
> 
> Thanks for the overwhelming feedback :)
> 
> I mostly agree with what you've written.  My plan has been to allow adhoc use of the parsers/serializers.  I haven't pushed to Github in a bit so I don't 100% remember what's out there, but right now my ControllerService is setup so that you can use it for either/both:
> 
> public interface RecordConversionService extends ControllerService {
>    RecordConverter getRecordConverter(ProcessContext context)  throws ProcessException;
>    RecordSet getRecordSet(FlowFile fIn, ProcessSession session);
> }
> 
> Maybe it just needs a rename? Maybe just call it `RecordService`.  If you think it makes more sense from a usability/API standpoint I have no problem with splitting this up into two ControllerServices (parse/serialize).
> 
> On Schemas and Fields, I completely agree.  It felt a little off to me, and doing it your way makes more sense, and also aligns with how other implementations do it (ResultSet/Avro/etc...).
> 
> You did slightly confuse me when you brought up `String.valueOf`, since String.valueOf just calls Object.toString internally, and calling toString on an Object that is already a String returns the same object reference (no conversion of any kind); so at that point String.valueOf(o) and (String)o appear equivalent.
> 
> Also, while value types like Integer do have a `valueOf` method, they generally accept Strings, so if you take that route and convert the type to String and then back to the actual object type you risk losing data due to Java default String formatting (hypothetically). Whereas if we know it's already of that Object type under the hood we can cast directly.  Can you be a bit more concise on how you'd like to see data type conversions and what the issue is with direct casts?  I think all of the automated interfaces will just use getObject anyways, the other methods are more for custom processor development.
> 
> On getUnderlyingObject, I don't think having it generic really helps anyone.  I wasn't going to include this getter at all, but when I was working on my original code I was working with QueryDatabaseTable and it needs to pass the ResultSet object to the Max Value collector, so I needed an easy way to access it.  If you need the underlying object internally (for this implementation specific parser/serializer), you will probably just reference the typed field copy or an implementation specific getter. If you are accessing it externally (custom processor), you'll be using interfaces and so you'd have to know exactly what type the underlying object was to begin with for it to help you, or put in custom logic based on the name of the converter.
> 
> Thanks,
>  Peter
> 
> -----Original Message-----
> From: Mark Payne [mailto:markap14@hotmail.com] 
> Sent: Saturday, September 03, 2016 6:51 PM
> To: dev@nifi.apache.org
> Subject: Re: Looking for feedback on my WIP Design
> 
> Hey Peter,
> 
> Finally getting a chance to take a look at some of this stuff. Very cool that you're jumping in on this! I think this has the potential to really enable a lot of great new possibilities very quickly and easily. Matt B. and I discussed this concept a while back on the mailing list a bit but didn't get into great detail. He may have some great thoughts that he can contribute, as well. I jotted down some thoughts that I had on the design as I looked over it...
> 
> * I think we should not really think of these are record conversions, but rather two distinct concepts - record parsing and record serialization. This opens up a lot more possibilities, as we could use a Record Parser service for a more generic PutHBase processors, for instance. Likely, we could use the Record Serialization service with a source processor like ExecuteSQL. And then a processor that wants to convert from one format to another could use both a parser and a serializer. I don't think the RecordConversionService is really needed, as I think it makes more sense to just have a single processor to do conversion and these service API's would be used by this processor.
> 
> * Rather than having a RecordSet that contains a Map<String, Field> getSchema() and a lot of "getSupportsXYZ()" types of methods, I think we should instead have a "Schema getSchema()" method and move the description of what is supported into the Schema object.
> 
> * I have written record parsing libraries before, and while the natural tendency is to create a structure such as Record that has a bunch of Field objects, this can actually cripple performance when running on large volumes of records. For instance, if you were running just 50 MB/sec. of incoming data, and lets say that you have fields that average 100 bytes each, you're talking half a million Field objects per second that get created and have to be garbage collected. And these are fairly conservative numbers - you could easily reach many million Field objects per second. So the approach that I would recommend is instead of having Record return a bunch of Field objects, instead have accessor methods on Record such as:
> 
> getString(int fieldIndex);
> getString(String fieldName);
> getInt(int fieldIndex);
> getInt(String fieldName);
> getType(int fieldIndex);
> getName(int fieldIndex);
> 
> * I would also expect that calling getString(1) would convert the first field to a String by using String.vlaueOf or something like that.
> I would expect some sort of inferred conversion, but the impl you have here just does a type-cast, which can lead to a lot of redundant code being used by the clients of the library.
> 
> * If Record is going to have a 'getUnderlyingObject' method that returns an Object, it may make more sense to instead make Record generic and return an object of type T?
> 
> I know this is quite a bit of feedback. Hopefully not too overwhelming :) There's a whole lot of good stuff here, and I think this is going to turn out to be an awesome addition to NiFi. Very excited about it!
> 
> 
> -Mark
> 
> 
> 
>> On Aug 23, 2016, at 1:00 PM, Peter Wicks (pwicks) <pw...@micron.com> wrote:
>> 
>> Matt,
>> 
>> I've re-architected it a bit.  I'll admit that I haven't tested this new design, but it makes sense, wanted to get your feedback.  Let me know if this somewhat lines up with your vision.
>> 
>> https://github.com/patricker/nifi/tree/RowProcessorService
>> 
>> Then we can have cool processors like `RecordSetConverter` that can convert between arbitrary types, and `RecordSetToSQL` etc...  Could be really cool.
>> 
>> --Peter
>> 
>> -----Original Message-----
>> From: Peter Wicks (pwicks)
>> Sent: Monday, August 22, 2016 8:57 PM
>> To: 'dev@nifi.apache.org' <de...@nifi.apache.org>
>> Subject: RE: Looking for feedback on my WIP Design
>> 
>> Matt,
>> 
>> When you put it that way it sounds like we'll be taking over the world by breakfast...
>> 
>> It sounds like this has a lot of other potential uses, such as a generic conversion processor instead of all these one-off a-to-b conversion processors that are floating around. I hadn't really thought that far, but I can see you went there straight away.
>> 
>> I'll put some more work into it along the lines you outlined and check back in.
>> 
>> Thanks,
>> Peter
>> 
>> -----Original Message-----
>> From: Matt Burgess [mailto:mattyb149@gmail.com]
>> Sent: Monday, August 22, 2016 7:58 PM
>> To: dev@nifi.apache.org
>> Subject: Re: Looking for feedback on my WIP Design
>> 
>> Peter,
>> 
>> First of all, great work! I couldn't find an Apache Jira for this, but I remember seeing something in the dev email list about perhaps having a ControllerService for arbitrary conversions.
>> 
>> I took a look at the commit; first things first, looks good for the use case, thanks much!  A handful of notes:
>> 
>> - As you know, the critical part is your "RowConverter".  To be most 
>> useful, we should make sure (like Java has, over time, with their 
>> supported SQL types), we can refer to structured types in some common 
>> language.  So "startNewFile()" might be better as "newRecord()", and
>> addRow() might be better expressed as a NiFi-defined interface.
>> ResultSets could have a simple wrapper implementing the generic interface that would let the code consume it the same as any other object.
>> 
>> - For a common language, perhaps we could refer to structured data as "records" and individual fields (also perhaps nested) as "fields".
>> 
>> - With a common NiFi-defined API for getting records and fields, we could implement all the converters you mention without any explicit dependency on an implementation NAR/JAR.
>> 
>> - We should avoid the explicit conversion of input format to any intermediate format; in other words, the interface should follow the Adapter or Facade pattern, and should convert-on-read.
>> 
>> - I'm sure I've forgotten some of the details between format details, but I'll list them for discussion with respect to conversions:
>> 
>> 1) XML
>> 2) JSON
>> 3) CSV (and TSV and other text-based delimited files)
>> 4) SQL ResultSet
>> 5) CQL ResultSet (Cassandra)
>> 6) Elasticsearch result set
>> 7) Avro
>> 8) Parquet
>> 
>> If we can cover these with a single interface and 8(-ish) 
>> implementations, then I think we're in good shape for world domination
>> :) Not being sarcastic, I'm saying let's make it happen!
>> 
>> Regards,
>> Matt
>> 
>> On Mon, Aug 22, 2016 at 9:32 PM, Peter Wicks (pwicks) <pw...@micron.com> wrote:
>>> I'm working on a change to QueryDatabaseTable (and eventually would apply to ExecuteSQL) to allow users to choose the output format, so something besides just Avro.  I'm planning to put in a ticket soon for my work.
>>> 
>>> Goals:
>>> 
>>> -          If this update goes out no user should be affected, as defaults will work the way they have before.
>>> 
>>> -          Don't want to muck up dependencies of standard processors with lots of other libraries to support multiple formats. As such I have implemented it using Controller Services to make the converters pluggable.
>>> 
>>> -          Would like to support these output formats to start:
>>> 
>>> o   Avro (This one is already in my commit, as I copy/pasted the code over; but the logic has now been moved to an appropriate library, which I like)
>>> 
>>> o   ORC (Would be implemented in the HIVE library)
>>> 
>>> o   JSON (No idea where I would put this processor, unless it's in a new module)
>>> 
>>> o   Delimited (No idea where I would put this processor, unless it's in a new module)
>>> 
>>> Here is my commit: 
>>> https://github.com/patricker/nifi/commit/01a79804f60b6be0f86499949712
>>> c
>>> d9118fb4f7f
>>> 
>>> I'd appreciate feedback on design/implementation.  I have the guts in there of how I was planning to implement this.
>>> 
>>> Thanks,
>>> Peter
> 


RE: Looking for feedback on my WIP Design

Posted by "Peter Wicks (pwicks)" <pw...@micron.com>.
Mark,

Thanks for the overwhelming feedback :)

I mostly agree with what you've written.  My plan has been to allow adhoc use of the parsers/serializers.  I haven't pushed to Github in a bit so I don't 100% remember what's out there, but right now my ControllerService is setup so that you can use it for either/both:

public interface RecordConversionService extends ControllerService {
    RecordConverter getRecordConverter(ProcessContext context)  throws ProcessException;
    RecordSet getRecordSet(FlowFile fIn, ProcessSession session);
}

Maybe it just needs a rename? Maybe just call it `RecordService`.  If you think it makes more sense from a usability/API standpoint I have no problem with splitting this up into two ControllerServices (parse/serialize).

On Schemas and Fields, I completely agree.  It felt a little off to me, and doing it your way makes more sense, and also aligns with how other implementations do it (ResultSet/Avro/etc...).

You did slightly confuse me when you brought up `String.valueOf`, since String.valueOf just calls Object.toString internally, and calling toString on an Object that is already a String returns the same object reference (no conversion of any kind); so at that point String.valueOf(o) and (String)o appear equivalent.

Also, while value types like Integer do have a `valueOf` method, they generally accept Strings, so if you take that route and convert the type to String and then back to the actual object type you risk losing data due to Java default String formatting (hypothetically). Whereas if we know it's already of that Object type under the hood we can cast directly.  Can you be a bit more concise on how you'd like to see data type conversions and what the issue is with direct casts?  I think all of the automated interfaces will just use getObject anyways, the other methods are more for custom processor development.

On getUnderlyingObject, I don't think having it generic really helps anyone.  I wasn't going to include this getter at all, but when I was working on my original code I was working with QueryDatabaseTable and it needs to pass the ResultSet object to the Max Value collector, so I needed an easy way to access it.  If you need the underlying object internally (for this implementation specific parser/serializer), you will probably just reference the typed field copy or an implementation specific getter. If you are accessing it externally (custom processor), you'll be using interfaces and so you'd have to know exactly what type the underlying object was to begin with for it to help you, or put in custom logic based on the name of the converter.

Thanks,
  Peter

-----Original Message-----
From: Mark Payne [mailto:markap14@hotmail.com] 
Sent: Saturday, September 03, 2016 6:51 PM
To: dev@nifi.apache.org
Subject: Re: Looking for feedback on my WIP Design

Hey Peter,

Finally getting a chance to take a look at some of this stuff. Very cool that you're jumping in on this! I think this has the potential to really enable a lot of great new possibilities very quickly and easily. Matt B. and I discussed this concept a while back on the mailing list a bit but didn't get into great detail. He may have some great thoughts that he can contribute, as well. I jotted down some thoughts that I had on the design as I looked over it...

* I think we should not really think of these are record conversions, but rather two distinct concepts - record parsing and record serialization. This opens up a lot more possibilities, as we could use a Record Parser service for a more generic PutHBase processors, for instance. Likely, we could use the Record Serialization service with a source processor like ExecuteSQL. And then a processor that wants to convert from one format to another could use both a parser and a serializer. I don't think the RecordConversionService is really needed, as I think it makes more sense to just have a single processor to do conversion and these service API's would be used by this processor.

* Rather than having a RecordSet that contains a Map<String, Field> getSchema() and a lot of "getSupportsXYZ()" types of methods, I think we should instead have a "Schema getSchema()" method and move the description of what is supported into the Schema object.

* I have written record parsing libraries before, and while the natural tendency is to create a structure such as Record that has a bunch of Field objects, this can actually cripple performance when running on large volumes of records. For instance, if you were running just 50 MB/sec. of incoming data, and lets say that you have fields that average 100 bytes each, you're talking half a million Field objects per second that get created and have to be garbage collected. And these are fairly conservative numbers - you could easily reach many million Field objects per second. So the approach that I would recommend is instead of having Record return a bunch of Field objects, instead have accessor methods on Record such as:

getString(int fieldIndex);
getString(String fieldName);
getInt(int fieldIndex);
getInt(String fieldName);
getType(int fieldIndex);
getName(int fieldIndex);

* I would also expect that calling getString(1) would convert the first field to a String by using String.vlaueOf or something like that.
I would expect some sort of inferred conversion, but the impl you have here just does a type-cast, which can lead to a lot of redundant code being used by the clients of the library.

* If Record is going to have a 'getUnderlyingObject' method that returns an Object, it may make more sense to instead make Record generic and return an object of type T?

I know this is quite a bit of feedback. Hopefully not too overwhelming :) There's a whole lot of good stuff here, and I think this is going to turn out to be an awesome addition to NiFi. Very excited about it!


-Mark



> On Aug 23, 2016, at 1:00 PM, Peter Wicks (pwicks) <pw...@micron.com> wrote:
> 
> Matt,
> 
> I've re-architected it a bit.  I'll admit that I haven't tested this new design, but it makes sense, wanted to get your feedback.  Let me know if this somewhat lines up with your vision.
> 
> https://github.com/patricker/nifi/tree/RowProcessorService
> 
> Then we can have cool processors like `RecordSetConverter` that can convert between arbitrary types, and `RecordSetToSQL` etc...  Could be really cool.
> 
> --Peter
> 
> -----Original Message-----
> From: Peter Wicks (pwicks)
> Sent: Monday, August 22, 2016 8:57 PM
> To: 'dev@nifi.apache.org' <de...@nifi.apache.org>
> Subject: RE: Looking for feedback on my WIP Design
> 
> Matt,
> 
> When you put it that way it sounds like we'll be taking over the world by breakfast...
> 
> It sounds like this has a lot of other potential uses, such as a generic conversion processor instead of all these one-off a-to-b conversion processors that are floating around. I hadn't really thought that far, but I can see you went there straight away.
> 
> I'll put some more work into it along the lines you outlined and check back in.
> 
> Thanks,
>  Peter
> 
> -----Original Message-----
> From: Matt Burgess [mailto:mattyb149@gmail.com]
> Sent: Monday, August 22, 2016 7:58 PM
> To: dev@nifi.apache.org
> Subject: Re: Looking for feedback on my WIP Design
> 
> Peter,
> 
> First of all, great work! I couldn't find an Apache Jira for this, but I remember seeing something in the dev email list about perhaps having a ControllerService for arbitrary conversions.
> 
> I took a look at the commit; first things first, looks good for the use case, thanks much!  A handful of notes:
> 
> - As you know, the critical part is your "RowConverter".  To be most 
> useful, we should make sure (like Java has, over time, with their 
> supported SQL types), we can refer to structured types in some common 
> language.  So "startNewFile()" might be better as "newRecord()", and
> addRow() might be better expressed as a NiFi-defined interface.
> ResultSets could have a simple wrapper implementing the generic interface that would let the code consume it the same as any other object.
> 
> - For a common language, perhaps we could refer to structured data as "records" and individual fields (also perhaps nested) as "fields".
> 
> - With a common NiFi-defined API for getting records and fields, we could implement all the converters you mention without any explicit dependency on an implementation NAR/JAR.
> 
> - We should avoid the explicit conversion of input format to any intermediate format; in other words, the interface should follow the Adapter or Facade pattern, and should convert-on-read.
> 
> - I'm sure I've forgotten some of the details between format details, but I'll list them for discussion with respect to conversions:
> 
> 1) XML
> 2) JSON
> 3) CSV (and TSV and other text-based delimited files)
> 4) SQL ResultSet
> 5) CQL ResultSet (Cassandra)
> 6) Elasticsearch result set
> 7) Avro
> 8) Parquet
> 
> If we can cover these with a single interface and 8(-ish) 
> implementations, then I think we're in good shape for world domination
> :) Not being sarcastic, I'm saying let's make it happen!
> 
> Regards,
> Matt
> 
> On Mon, Aug 22, 2016 at 9:32 PM, Peter Wicks (pwicks) <pw...@micron.com> wrote:
>> I'm working on a change to QueryDatabaseTable (and eventually would apply to ExecuteSQL) to allow users to choose the output format, so something besides just Avro.  I'm planning to put in a ticket soon for my work.
>> 
>> Goals:
>> 
>> -          If this update goes out no user should be affected, as defaults will work the way they have before.
>> 
>> -          Don't want to muck up dependencies of standard processors with lots of other libraries to support multiple formats. As such I have implemented it using Controller Services to make the converters pluggable.
>> 
>> -          Would like to support these output formats to start:
>> 
>> o   Avro (This one is already in my commit, as I copy/pasted the code over; but the logic has now been moved to an appropriate library, which I like)
>> 
>> o   ORC (Would be implemented in the HIVE library)
>> 
>> o   JSON (No idea where I would put this processor, unless it's in a new module)
>> 
>> o   Delimited (No idea where I would put this processor, unless it's in a new module)
>> 
>> Here is my commit: 
>> https://github.com/patricker/nifi/commit/01a79804f60b6be0f86499949712
>> c
>> d9118fb4f7f
>> 
>> I'd appreciate feedback on design/implementation.  I have the guts in there of how I was planning to implement this.
>> 
>> Thanks,
>>  Peter