You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@nifi.apache.org by Mark Petronic <ma...@gmail.com> on 2015/11/29 15:36:37 UTC

Concept of async event-driven processors but not the experimental scheduling policy version

I know there is an experimental event-driven scheduling policy. This is not
that. Has anyone considered a pattern where processors might emit events
based on certain criteria and other processors might ONLY act on those
events? I'm just thinking out loud on this thought at the moment and just
wanted to see if anyone else had pondered this concept. Here's my use case.
Consider the RouteText processor feeding into a PutHDFS. RouteText is
grouping records on yyyymmdd values using a regex because I want to
partition files into HDFS directories by yyyymmdd and then use Hive to
query the data. PutHDFS simply uses the RouteText.group attribute to create
the year/month/day HDFS directory structure like:

/stats/year=2015/month=11/day=28/the_stats_file_000001.csv

However, I need to ALSO run a Hive HQL command to "alter table X add
partition Y" to allow Hive to see this new partition of data. So, the
"event" part of this concept would be some way to instruct PutHDFS to emit
an event ONLY when it actually creates a new directory. There could be an
"event" relationship that could feed some other processor, like ExecuteSQL,
that would then add the partition ONLY when this event occurs. It would NOT
act on any flowfiles - only on events. There will be lots of files being
put that will fall into an ALREADY existing directory since PutHDFS only
has to create the directory structure ONCE. I only need to know about that
ONE event so as to run the HQL command ONCE. I know there are ways to
"wire" this up using existing processors, like use ExecuteStreamCommand to
run a script that checks if the directory exists, and, if not, create it
and run the SQL processor to run a SQL command against Hive to build the
partition and then let PutHDFS do it's thing. But that means running this
script on EVERY flow file which is a waster of resources.
Only PutHDFS really knows when it needs to create the directory ONCE. I was
just wondering if there was any thought of building in some asyc event
handling?

Anyway, just an idea.

Mark

Re: Concept of async event-driven processors but not the experimental scheduling policy version

Posted by Mark Petronic <ma...@gmail.com>.

Mark, I was thinking exactly this same thought about adding a dir.created
type attribute to key off and thereby continuing to work within the current
framework. Thanks for you thoughts.

On Sun, Nov 29, 2015 at 10:51 AM, Mark Payne <ma...@hotmail.com> wrote:

> Mark,
>
> I can't say that I've ever really given thought to an explicitly "Eventing
> Model" like the one that you are describing.
> However, the way that you are describing it is really just a new
> relationship on the PutHDFS processor, so it would
> be a very processor-specific change.
>
> Rather than adding a new "Event" type of relationship, though, I would
> lean more toward creating an attribute on the existing
> FlowFile that is routed to 'success'. So an attribute named, say,
> "hdfs.directory.created" could be added to the FlowFile
> and if you care about that information, you can route the FlowFiles o a
> RouteOnAttribute processor, which
> is able to route the FlowFIle accordingly.
>
> Does this give you what you need?
>
> Thanks
> -Mark
>
>
>
> > On Nov 29, 2015, at 9:36 AM, Mark Petronic <ma...@gmail.com>
> wrote:
> >
> > I know there is an experimental event-driven scheduling policy. This is
> not that. Has anyone considered a pattern where processors might emit
> events based on certain criteria and other processors might ONLY act on
> those events? I'm just thinking out loud on this thought at the moment and
> just wanted to see if anyone else had pondered this concept. Here's my use
> case. Consider the RouteText processor feeding into a PutHDFS. RouteText is
> grouping records on yyyymmdd values using a regex because I want to
> partition files into HDFS directories by yyyymmdd and then use Hive to
> query the data. PutHDFS simply uses the RouteText.group attribute to create
> the year/month/day HDFS directory structure like:
> >
> > /stats/year=2015/month=11/day=28/the_stats_file_000001.csv
> >
> > However, I need to ALSO run a Hive HQL command to "alter table X add
> partition Y" to allow Hive to see this new partition of data. So, the
> "event" part of this concept would be some way to instruct PutHDFS to emit
> an event ONLY when it actually creates a new directory. There could be an
> "event" relationship that could feed some other processor, like ExecuteSQL,
> that would then add the partition ONLY when this event occurs. It would NOT
> act on any flowfiles - only on events. There will be lots of files being
> put that will fall into an ALREADY existing directory since PutHDFS only
> has to create the directory structure ONCE. I only need to know about that
> ONE event so as to run the HQL command ONCE. I know there are ways to
> "wire" this up using existing processors, like use ExecuteStreamCommand to
> run a script that checks if the directory exists, and, if not, create it
> and run the SQL processor to run a SQL command against Hive to build the
> partition and then let PutHDFS do it's thing. But that means running this
> script on EVERY flow file which is a waster of resources. Only PutHDFS
> really knows when it needs to create the directory ONCE. I was just
> wondering if there was any thought of building in some asyc event handling?
> >
> > Anyway, just an idea.
> >
> > Mark
>
>

Re: Concept of async event-driven processors but not the experimental scheduling policy version

Posted by Mark Payne <ma...@hotmail.com>.

Mark,

I can't say that I've ever really given thought to an explicitly "Eventing Model" like the one that you are describing.
However, the way that you are describing it is really just a new relationship on the PutHDFS processor, so it would
be a very processor-specific change.

Rather than adding a new "Event" type of relationship, though, I would lean more toward creating an attribute on the existing
FlowFile that is routed to 'success'. So an attribute named, say, "hdfs.directory.created" could be added to the FlowFile
and if you care about that information, you can route the FlowFiles o a RouteOnAttribute processor, which
is able to route the FlowFIle accordingly.

Does this give you what you need?

Thanks
-Mark



> On Nov 29, 2015, at 9:36 AM, Mark Petronic <ma...@gmail.com> wrote:
> 
> I know there is an experimental event-driven scheduling policy. This is not that. Has anyone considered a pattern where processors might emit events based on certain criteria and other processors might ONLY act on those events? I'm just thinking out loud on this thought at the moment and just wanted to see if anyone else had pondered this concept. Here's my use case. Consider the RouteText processor feeding into a PutHDFS. RouteText is grouping records on yyyymmdd values using a regex because I want to partition files into HDFS directories by yyyymmdd and then use Hive to query the data. PutHDFS simply uses the RouteText.group attribute to create the year/month/day HDFS directory structure like:
> 
> /stats/year=2015/month=11/day=28/the_stats_file_000001.csv
> 
> However, I need to ALSO run a Hive HQL command to "alter table X add partition Y" to allow Hive to see this new partition of data. So, the "event" part of this concept would be some way to instruct PutHDFS to emit an event ONLY when it actually creates a new directory. There could be an "event" relationship that could feed some other processor, like ExecuteSQL, that would then add the partition ONLY when this event occurs. It would NOT act on any flowfiles - only on events. There will be lots of files being put that will fall into an ALREADY existing directory since PutHDFS only has to create the directory structure ONCE. I only need to know about that ONE event so as to run the HQL command ONCE. I know there are ways to "wire" this up using existing processors, like use ExecuteStreamCommand to run a script that checks if the directory exists, and, if not, create it and run the SQL processor to run a SQL command against Hive to build the partition and then let PutHDFS do it's thing. But that means running this script on EVERY flow file which is a waster of resources. Only PutHDFS really knows when it needs to create the directory ONCE. I was just wondering if there was any thought of building in some asyc event handling?
> 
> Anyway, just an idea.
> 
> Mark