You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nifi.apache.org by Bryan Bende <bb...@gmail.com> on 2015/08/13 03:09:24 UTC

[DISCUSS] Feature proposal: First-class Avro Support

All,

Given how popular Avro has become, I'm very interested in making progress
on providing first-class support with in NiFi. I took a stab at filling in
some of the requirements on the Feature Proposal Wiki page [1] and wanted
to get feedback from everyone to see if these ideas are headed in the right
direction.

Are there any major features missing from that list? any other
recommendations?

I'm also proposing that we create a new Avro bundle to capture the
functionality that is decided upon, and we can consider whether any of the
existing Avro-specific functionality in the Kite bundle could eventually
move to the Avro bundle. If anyone feels strongly about this, or has an
alternative recommendation, let us know.

[1]
https://cwiki.apache.org/confluence/display/NIFI/First-class+Avro+Support

Thanks,

Bryan

Re: [DISCUSS] Feature proposal: First-class Avro Support

Posted by Ricky Saltzer <ri...@cloudera.com>.
Thanks for the write up, the proposal looks great. I imagine moving towards
a standardization on using Avro for record centric FlowFiles will make
writing future processors a lot easier, too. All of the proposed
requirements seem doable without too much trouble.

On Thu, Aug 13, 2015 at 10:26 AM, Mark Payne <ma...@hotmail.com> wrote:

> Bryan,
>
> The wiki looks good to me. A lot of stuff going on there, but I think it's
> all good. Would love to see
> a lot more support for Avro!
>
> Thanks
> -Mark
>
> ----------------------------------------
> > Date: Wed, 12 Aug 2015 21:09:24 -0400
> > Subject: [DISCUSS] Feature proposal: First-class Avro Support
> > From: bbende@gmail.com
> > To: dev@nifi.apache.org
> >
> > All,
> >
> > Given how popular Avro has become, I'm very interested in making progress
> > on providing first-class support with in NiFi. I took a stab at filling
> in
> > some of the requirements on the Feature Proposal Wiki page [1] and wanted
> > to get feedback from everyone to see if these ideas are headed in the
> right
> > direction.
> >
> > Are there any major features missing from that list? any other
> > recommendations?
> >
> > I'm also proposing that we create a new Avro bundle to capture the
> > functionality that is decided upon, and we can consider whether any of
> the
> > existing Avro-specific functionality in the Kite bundle could eventually
> > move to the Avro bundle. If anyone feels strongly about this, or has an
> > alternative recommendation, let us know.
> >
> > [1]
> >
> https://cwiki.apache.org/confluence/display/NIFI/First-class+Avro+Support
> >
> > Thanks,
> >
> > Bryan
>
>



-- 
Ricky Saltzer
http://www.cloudera.com

Re: [DISCUSS] Feature proposal: First-class Avro Support

Posted by Joey Echeverria <jo...@gmail.com>.
That functionality sounds awesome! I'm be a big +1. I can't speak for
Ryan, but I think it would make sense to move the format conversion
processors to an avro nar.

On Thu, Aug 13, 2015 at 10:26 AM, Mark Payne <ma...@hotmail.com> wrote:
> Bryan,
>
> The wiki looks good to me. A lot of stuff going on there, but I think it's all good. Would love to see
> a lot more support for Avro!
>
> Thanks
> -Mark
>
> ----------------------------------------
>> Date: Wed, 12 Aug 2015 21:09:24 -0400
>> Subject: [DISCUSS] Feature proposal: First-class Avro Support
>> From: bbende@gmail.com
>> To: dev@nifi.apache.org
>>
>> All,
>>
>> Given how popular Avro has become, I'm very interested in making progress
>> on providing first-class support with in NiFi. I took a stab at filling in
>> some of the requirements on the Feature Proposal Wiki page [1] and wanted
>> to get feedback from everyone to see if these ideas are headed in the right
>> direction.
>>
>> Are there any major features missing from that list? any other
>> recommendations?
>>
>> I'm also proposing that we create a new Avro bundle to capture the
>> functionality that is decided upon, and we can consider whether any of the
>> existing Avro-specific functionality in the Kite bundle could eventually
>> move to the Avro bundle. If anyone feels strongly about this, or has an
>> alternative recommendation, let us know.
>>
>> [1]
>> https://cwiki.apache.org/confluence/display/NIFI/First-class+Avro+Support
>>
>> Thanks,
>>
>> Bryan
>

RE: [DISCUSS] Feature proposal: First-class Avro Support

Posted by Mark Payne <ma...@hotmail.com>.
Bryan,

The wiki looks good to me. A lot of stuff going on there, but I think it's all good. Would love to see
a lot more support for Avro!

Thanks
-Mark

----------------------------------------
> Date: Wed, 12 Aug 2015 21:09:24 -0400
> Subject: [DISCUSS] Feature proposal: First-class Avro Support
> From: bbende@gmail.com
> To: dev@nifi.apache.org
>
> All,
>
> Given how popular Avro has become, I'm very interested in making progress
> on providing first-class support with in NiFi. I took a stab at filling in
> some of the requirements on the Feature Proposal Wiki page [1] and wanted
> to get feedback from everyone to see if these ideas are headed in the right
> direction.
>
> Are there any major features missing from that list? any other
> recommendations?
>
> I'm also proposing that we create a new Avro bundle to capture the
> functionality that is decided upon, and we can consider whether any of the
> existing Avro-specific functionality in the Kite bundle could eventually
> move to the Avro bundle. If anyone feels strongly about this, or has an
> alternative recommendation, let us know.
>
> [1]
> https://cwiki.apache.org/confluence/display/NIFI/First-class+Avro+Support
>
> Thanks,
>
> Bryan
 		 	   		  

Re: [DISCUSS] Feature proposal: First-class Avro Support

Posted by Toivo Adams <to...@gmail.com>.
Hello

Avro Content Viewer would be great.
Does anyone has time to implement this?

Thanks
toivo




--
View this message in context: http://apache-nifi-incubating-developer-list.39713.n7.nabble.com/DISCUSS-Feature-proposal-First-class-Avro-Support-tp2437p2676.html
Sent from the Apache NiFi (incubating) Developer List mailing list archive at Nabble.com.

Re: [DISCUSS] Feature proposal: First-class Avro Support

Posted by Bryan Bende <bb...@gmail.com>.
Ryan,

Thanks for the feedback and suggestions! We will definitely factor all of
this into the design, and when I get a chance I will update the Wiki page
accordingly.

Thanks,

Bryan

On Sat, Aug 15, 2015 at 5:45 PM, Ryan Blue <bl...@cloudera.com> wrote:

> On 08/12/2015 06:09 PM, Bryan Bende wrote:
>
>> All,
>>
>> Given how popular Avro has become, I'm very interested in making progress
>> on providing first-class support with in NiFi. I took a stab at filling in
>> some of the requirements on the Feature Proposal Wiki page [1] and wanted
>> to get feedback from everyone to see if these ideas are headed in the
>> right
>> direction.
>>
>> Are there any major features missing from that list? any other
>> recommendations?
>>
>> I'm also proposing that we create a new Avro bundle to capture the
>> functionality that is decided upon, and we can consider whether any of the
>> existing Avro-specific functionality in the Kite bundle could eventually
>> move to the Avro bundle. If anyone feels strongly about this, or has an
>> alternative recommendation, let us know.
>>
>> [1]
>> https://cwiki.apache.org/confluence/display/NIFI/First-class+Avro+Support
>>
>> Thanks,
>>
>> Bryan
>>
>
> Thanks for putting this together, Bryan!
>
> I have a few thoughts and observations about the proposal:
>
> * Conversion to Avro is an easier problem than conversion from Avro. Item
> #2 is to convert from Avro to other formats like CSV, but that isn't
> possible for some Avro schemas. For example, Avro supports nested lists and
> maps that have no good representation in CSV so we'll have to be careful
> about that conversion. It is possible for a lot of data and is definitely
> valuable, though.
>
> * For #3, converting Avro records, I'd also like to see the addition of
> transformation expressions. For example, I might have a timestamp in
> seconds that I need to convert to the Avro timestamp-millis type by
> multiplying the value by 1000.
>
> * There are a few systems like Flume that use Avro serialization for
> individual records, without the Avro file container. This complicates
> behavior a bit. Your suggestion to have merge/split is great, but we should
> plan on having a couple of scenarios for it:
>   - Merge/split between files and bare records with schema header
>   - Merge/split Avro files to produce different sized files
>
> * The "extract fingerprint" processor could be more general and populate a
> few fields from the Avro header:
>   - Schema definition (full, not fp)
>   - Schema fingerprint
>   - Schema root record name (if schema is a record)
>   - Key/value metadata, like compression codec
>
> * It looks like #7, evaluate paths, and #8, update records, are intended
> for the case where the content is a bare Avro record. I'm not sure that
> evaluating paths would work for Avro files.
>
> * For the update records processor, this is really similar to the
> processor to convert between Avro schemas, #3. I suggest merging the two
> and making it easy to work with either a file or a record via record-level
> callback. This would be useful elsewhere as well. Maybe tell the difference
> between file and record by checking for the filename attribute?
>
> On the subject of where these processors go, I'm not attached to them
> being in the Kite bundle. It would probably be better to separate that out.
> However, there are some specific features in the Kite bundle that I think
> are really valuable:
>   - Use a schema file from a HDFS path (requires Hadoop config)
>   - Use the current schema of a dataset/table
>
> Those make it possible to update a table schema, then have that change
> propagate to the conversion in NiFi. So if I start receiving a new field in
> my JSON data, I just update a table definition and then the processor picks
> up the change either automatically or with a restart.
>
> The other complication is that the libraries for reading JSON and CSV (and
> from an InputFormat if you are interested) are in Kite, so you'll have a
> Kite dependency either way. We can look at separating the support into
> stand-alone Kite modules or moving it into the upstream Avro project.
>
> Overall, this looks like a great addition!
>
> rb
>
>
> --
> Ryan Blue
> Software Engineer
> Cloudera, Inc.
>

Re: [DISCUSS] Feature proposal: First-class Avro Support

Posted by Ryan Blue <bl...@cloudera.com>.
On 08/12/2015 06:09 PM, Bryan Bende wrote:
> All,
>
> Given how popular Avro has become, I'm very interested in making progress
> on providing first-class support with in NiFi. I took a stab at filling in
> some of the requirements on the Feature Proposal Wiki page [1] and wanted
> to get feedback from everyone to see if these ideas are headed in the right
> direction.
>
> Are there any major features missing from that list? any other
> recommendations?
>
> I'm also proposing that we create a new Avro bundle to capture the
> functionality that is decided upon, and we can consider whether any of the
> existing Avro-specific functionality in the Kite bundle could eventually
> move to the Avro bundle. If anyone feels strongly about this, or has an
> alternative recommendation, let us know.
>
> [1]
> https://cwiki.apache.org/confluence/display/NIFI/First-class+Avro+Support
>
> Thanks,
>
> Bryan

Thanks for putting this together, Bryan!

I have a few thoughts and observations about the proposal:

* Conversion to Avro is an easier problem than conversion from Avro. 
Item #2 is to convert from Avro to other formats like CSV, but that 
isn't possible for some Avro schemas. For example, Avro supports nested 
lists and maps that have no good representation in CSV so we'll have to 
be careful about that conversion. It is possible for a lot of data and 
is definitely valuable, though.

* For #3, converting Avro records, I'd also like to see the addition of 
transformation expressions. For example, I might have a timestamp in 
seconds that I need to convert to the Avro timestamp-millis type by 
multiplying the value by 1000.

* There are a few systems like Flume that use Avro serialization for 
individual records, without the Avro file container. This complicates 
behavior a bit. Your suggestion to have merge/split is great, but we 
should plan on having a couple of scenarios for it:
   - Merge/split between files and bare records with schema header
   - Merge/split Avro files to produce different sized files

* The "extract fingerprint" processor could be more general and populate 
a few fields from the Avro header:
   - Schema definition (full, not fp)
   - Schema fingerprint
   - Schema root record name (if schema is a record)
   - Key/value metadata, like compression codec

* It looks like #7, evaluate paths, and #8, update records, are intended 
for the case where the content is a bare Avro record. I'm not sure that 
evaluating paths would work for Avro files.

* For the update records processor, this is really similar to the 
processor to convert between Avro schemas, #3. I suggest merging the two 
and making it easy to work with either a file or a record via 
record-level callback. This would be useful elsewhere as well. Maybe 
tell the difference between file and record by checking for the filename 
attribute?

On the subject of where these processors go, I'm not attached to them 
being in the Kite bundle. It would probably be better to separate that 
out. However, there are some specific features in the Kite bundle that I 
think are really valuable:
   - Use a schema file from a HDFS path (requires Hadoop config)
   - Use the current schema of a dataset/table

Those make it possible to update a table schema, then have that change 
propagate to the conversion in NiFi. So if I start receiving a new field 
in my JSON data, I just update a table definition and then the processor 
picks up the change either automatically or with a restart.

The other complication is that the libraries for reading JSON and CSV 
(and from an InputFormat if you are interested) are in Kite, so you'll 
have a Kite dependency either way. We can look at separating the support 
into stand-alone Kite modules or moving it into the upstream Avro project.

Overall, this looks like a great addition!

rb


-- 
Ryan Blue
Software Engineer
Cloudera, Inc.

Re: [DISCUSS] Feature proposal: First-class Avro Support

Posted by Joe Witt <jo...@gmail.com>.
Bryan

Very cool.  I would definitely want to get Ryan Blue's thoughts as the
originator of that nar but I think it makes sense to move non-kite
specific items to this potentially new 'nifi-avro-nar' and then rename
the current nar to 'nifi-kite-nar' with it retaining the kite specific
pieces.

The items you've put out sound great.  I don't think they all have to
happen at once either.

Thanks
Joe

On Wed, Aug 12, 2015 at 9:09 PM, Bryan Bende <bb...@gmail.com> wrote:
> All,
>
> Given how popular Avro has become, I'm very interested in making progress
> on providing first-class support with in NiFi. I took a stab at filling in
> some of the requirements on the Feature Proposal Wiki page [1] and wanted
> to get feedback from everyone to see if these ideas are headed in the right
> direction.
>
> Are there any major features missing from that list? any other
> recommendations?
>
> I'm also proposing that we create a new Avro bundle to capture the
> functionality that is decided upon, and we can consider whether any of the
> existing Avro-specific functionality in the Kite bundle could eventually
> move to the Avro bundle. If anyone feels strongly about this, or has an
> alternative recommendation, let us know.
>
> [1]
> https://cwiki.apache.org/confluence/display/NIFI/First-class+Avro+Support
>
> Thanks,
>
> Bryan