You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@nifi.apache.org by Mark Petronic <ma...@gmail.com> on 2015/11/07 22:57:22 UTC

Using InvokeHTTP to send GET but without content

I thought this seemed like a simple plan... I wanted to send an audit
message to a REST server every time I process every file in my flow. The
sub flow in question is:

+---------------+ +--------------+ +---------+
| UnpackContent +-->+ MergeContent +--merged+->+ PutFile |
+---------------+ +------+-------+ +---------+
|
v original
+-------+-------+ +----------+
|UpdateAttribute+--->+InvokeHTTP|
+---------------+ +----------+

I want to record every zip file unpacked so I feed (original) into
UpdateAttribute where I create a message like "Processed: ${ZipPath}",
where ${ZipPath} was added as an attribute earlier in the flow when I pull
in the zip files to process. I also want to send a another message from
PutFile once the file is delivered but my drawing does not show that part
yet. Same concept, hit a REST API to record the event. I also want to do
this sort of 'general' audit logging for other parts of the flow. For
example, I have a bunch of conditions/rule/actions to conditionally control
various processing. They are set up for files and content types of files
that I am aware of today. So, the [match] flow from some upstream
UpdateAttribute will flow the files that match the definitions. But, if
some unsuspecting file gets ingested, then it might NOT match a rule. So, I
want to route the [unmatched] into this audit logging where I can then see
this and decide what to do to handle this new file type. I know there is
providence but I want long term audits and don't need to retain the flow
file content and, frankly, it is just easier to work with plain audit logs
(grep/awk/etc), IMO, and for my needs at this time.

Problem is, InvokeHTTP never invokes. Why not? I think it would be nice if
InvokeHTTP would hit the URL without a flow file, as in my case, I never
want to actually send any body content but rather access attributes to
build up a URL with param=value pairs to send. I am probably off base here.
Any ideas? I saw this exchange but this seems kludgy using the generate
flow file just to "trick" Invoke HTTP into doing something.

https://mail-archives.apache.org/mod_mbox/nifi-users/201509.mbox/%3CCALJK9a4U29Ua9zVT9oPCYHMg=k4Vneao30EMszRe3yPkz6xxJQ@mail.gmail.com%3E

I am trying to adapt to the "Nifi" paradigm but feel like I am swimming
upstream at times.

Thanks

Re: Using InvokeHTTP to send GET but without content

Posted by Mark Petronic <ma...@gmail.com>.

Joe,

Thanks for the input. I ended up using ExecuteStreamProcess and wget
to take the place of InvokeHTTP for now. That works well and the
throughput is not intense so spawning off the wget on each REST call
is fine for now. I had looked at the reporting tasks that are
available. Extending in that way makes sense and I will look into that
as I progress towards a more production-quality deployment. I still
need to wrap my head around how I will do monitoring. Like, monitoring
the count of various events so that I can detect when certain flows
might be having issues. For example, I know I should be seeing files
ingested from 20 unique sources and, for each source, I expect to see
approximately N files to process per hour. I want to somehow alarm on
such criteria because, although Nifi might not be the reason for loss
of flow, often there are upstream stat sources that go bad and I will
be able to detect that using these metrics. But, that's down the road
a tad. Also, once the updated InvokeHTTP functionality is pushed, I
will convert to that and test accordingly and give you guys feedback.

As for a 'general' logging framework, here's one quick thought to add
to the mix of ideas. We all use logging extensively when we write
code, both for tracing/debugging and event recording. Nifi is more of
a 'Lego' style way of coding, so to speak. But, I still find myself
wanting to produce logging events for both debugging and understanding
my flows (as part of the typical development phase) as well as
monitoring. But I find there is not really any 'built in' support for
that. Processors are like functions or classes or modules in code. We
typically have logging frameworks that are function/class/module aware
and work with low friction in those contexts. So, in that light, I
seem to favor having some sort of 'general built in' logging in each
processor. Seems like it could be generic enough to be part of the
abstract classes so that it is easily inherited by processors.
Functionally, I would be looking at being about to define a log event
that could consist of an expression language defined string using any
attributes in the context of the processor generating the log event. I
would also see any routing condition playing a part so that, for
example, I could say "Log this string message for every 'unmatched'
condition but this log message for every 'matched' condition". There
should be some basic built in parts to the message like a timestamp,
processor name/id, etc that are included by the framework
automatically (or selectable (enabled/disable) by checkboxes or
something like that). I envision this very much like what you use in
log4j/python logging, etc - except not at the code level but at the
processor level. I would even see logging levels being able to be set
globally or locally (like overridden at the processor level). We would
have to consider the log sinks as well. Does it just leverage
logback/log4j for emitting and thereby, I could use those frameworks
to define the logfiles, rotation, or even send over socket/syslog
style. So, no reinventing of that stuff.

Anyway. that's just how I see it with my vast 2-3 weeks working with
Nifi. LOL. Maybe I am looking at this all wrong but it's my 'new user'
viewpoint. It's a great app for sure and I am now committed to it
being the cornerstone of my ingest/ETL into Hadoop. Now sure what I
would have been doing had I not stumbled upon Nifi at the Meetup I
attended for Spark GraphX, nonetheless. :) You guys have a winner here
and what really impresses me more is all of the very helpful,
responsive, support I see in the community for Nifi. That clearly
speaks of the momentum and passion of the users/devs and bodes well on
the products future.

Mark

On Sat, Nov 7, 2015 at 7:02 PM, Joe Witt <jo...@gmail.com> wrote:
> Mark,
>
> What you're trying to with regard to auditing is exactly what the
> ReportingTask extension is designed to support.  It has access to the
> provenance events which include for each event the flow file
> attributes.  If you built a ReportingTask which pulls whatever
> attributes you want and formats requests however you want you could in
> a rather straightforward manner report this data to a RESTful
> endpoint.  That approach requires you to build your own custom
> ReportingTask though.  As a reminder though probably not what you want
> here you could also have the auditing service pull the information
> from NiFi using NiFi's REST API.
>
> Regarding access to grep/awk/etc.. for audit log/provenance type data.
> There are others who have shared your viewpoint and I don't think
> anyone is opposed to providing support for this. This again would be a
> good thing to build as a reporting task where it would basically
> serialize events of interest to some log file in some deterministic
> format.  In fact perhaps we should just build a standard one in like
> syslog format or something similar.  If folks have ideas here on a
> good format to use please advise.  Now this approach simply means
> you'll end up with reliably formatted log files.  You'd still need to
> get this over to your REST endpoint but of course there are ways to
> slice that.
>
> If you'd like a zero-coding option then you should be able to approach
> it as you are.  InvokeHTTP has a known issue where it requires flow
> files to be present to fire off.  That is being addressed (potentially
> as soon as Monday) [1].  In your case though you will have FlowFiles
> available so the issue of why it isn't firing needs to be looked into
> further.  Your use case where you don't want to send the content of
> the FlowFile through makes sense.  I was surprised to see it wasn't
> already an option.  That too would be a good JIRA as obviously simply
> using the attributes themselves to craft a request would often be
> sufficient.  Am actually surprised we've not already run into that.
>
> [1] https://issues.apache.org/jira/browse/NIFI-1009
>
> Thanks
> Joe
>
> On Sat, Nov 7, 2015 at 4:57 PM, Mark Petronic <ma...@gmail.com> wrote:
>> I thought this seemed like a simple plan... I wanted to send an audit
>> message to a REST server every time I process every file in my flow. The sub
>> flow in question is:
>>
>> +---------------+   +--------------+           +---------+
>> | UnpackContent +-->+ MergeContent +--merged+->+ PutFile |
>> +---------------+   +------+-------+           +---------+
>>                            |
>>                            v original
>>                    +-------+-------+    +----------+
>>                    |UpdateAttribute+--->+InvokeHTTP|
>>                    +---------------+    +----------+
>>
>> I want to record every zip file unpacked so I feed (original) into
>> UpdateAttribute where I create a message like "Processed: ${ZipPath}", where
>> ${ZipPath} was added as an attribute earlier in the flow when I pull in the
>> zip files to process. I also want to send a another message from PutFile
>> once the file is delivered but my drawing does not show that part yet. Same
>> concept, hit a REST API to record the event. I also want to do this sort of
>> 'general' audit logging for other parts of the flow. For example, I have a
>> bunch of conditions/rule/actions to conditionally control various
>> processing. They are set up for files and content types of files that I am
>> aware of today. So, the [match] flow from some upstream UpdateAttribute will
>> flow the files that match the definitions. But, if some unsuspecting file
>> gets ingested, then it might NOT match a rule. So, I want to route the
>> [unmatched] into this audit logging where I can then see this and decide
>> what to do to handle this new file type. I know there is providence but I
>> want long term audits and don't need to retain the flow file content and,
>> frankly, it is just easier to work with plain audit logs (grep/awk/etc),
>> IMO, and for my needs at this time.
>>
>> Problem is, InvokeHTTP never invokes. Why not? I think it would be nice if
>> InvokeHTTP would hit the URL without a flow file, as in my case, I never
>> want to actually send any body content but rather access attributes to build
>> up a URL with param=value pairs to send. I am probably off base here. Any
>> ideas? I saw this exchange but this seems kludgy using the generate flow
>> file just to "trick" Invoke HTTP into doing something.
>>
>> https://mail-archives.apache.org/mod_mbox/nifi-users/201509.mbox/%3CCALJK9a4U29Ua9zVT9oPCYHMg=k4Vneao30EMszRe3yPkz6xxJQ@mail.gmail.com%3E
>>
>> I am trying to adapt to the "Nifi" paradigm but feel like I am swimming
>> upstream at times.
>>
>> Thanks
>>
>>

Re: Using InvokeHTTP to send GET but without content

Posted by Joe Witt <jo...@gmail.com>.

Mark,

What you're trying to with regard to auditing is exactly what the
ReportingTask extension is designed to support.  It has access to the
provenance events which include for each event the flow file
attributes.  If you built a ReportingTask which pulls whatever
attributes you want and formats requests however you want you could in
a rather straightforward manner report this data to a RESTful
endpoint.  That approach requires you to build your own custom
ReportingTask though.  As a reminder though probably not what you want
here you could also have the auditing service pull the information
from NiFi using NiFi's REST API.

Regarding access to grep/awk/etc.. for audit log/provenance type data.
There are others who have shared your viewpoint and I don't think
anyone is opposed to providing support for this. This again would be a
good thing to build as a reporting task where it would basically
serialize events of interest to some log file in some deterministic
format.  In fact perhaps we should just build a standard one in like
syslog format or something similar.  If folks have ideas here on a
good format to use please advise.  Now this approach simply means
you'll end up with reliably formatted log files.  You'd still need to
get this over to your REST endpoint but of course there are ways to
slice that.

If you'd like a zero-coding option then you should be able to approach
it as you are.  InvokeHTTP has a known issue where it requires flow
files to be present to fire off.  That is being addressed (potentially
as soon as Monday) [1].  In your case though you will have FlowFiles
available so the issue of why it isn't firing needs to be looked into
further.  Your use case where you don't want to send the content of
the FlowFile through makes sense.  I was surprised to see it wasn't
already an option.  That too would be a good JIRA as obviously simply
using the attributes themselves to craft a request would often be
sufficient.  Am actually surprised we've not already run into that.

[1] https://issues.apache.org/jira/browse/NIFI-1009

Thanks
Joe

On Sat, Nov 7, 2015 at 4:57 PM, Mark Petronic <ma...@gmail.com> wrote:
> I thought this seemed like a simple plan... I wanted to send an audit
> message to a REST server every time I process every file in my flow. The sub
> flow in question is:
>
> +---------------+   +--------------+           +---------+
> | UnpackContent +-->+ MergeContent +--merged+->+ PutFile |
> +---------------+   +------+-------+           +---------+
>                            |
>                            v original
>                    +-------+-------+    +----------+
>                    |UpdateAttribute+--->+InvokeHTTP|
>                    +---------------+    +----------+
>
> I want to record every zip file unpacked so I feed (original) into
> UpdateAttribute where I create a message like "Processed: ${ZipPath}", where
> ${ZipPath} was added as an attribute earlier in the flow when I pull in the
> zip files to process. I also want to send a another message from PutFile
> once the file is delivered but my drawing does not show that part yet. Same
> concept, hit a REST API to record the event. I also want to do this sort of
> 'general' audit logging for other parts of the flow. For example, I have a
> bunch of conditions/rule/actions to conditionally control various
> processing. They are set up for files and content types of files that I am
> aware of today. So, the [match] flow from some upstream UpdateAttribute will
> flow the files that match the definitions. But, if some unsuspecting file
> gets ingested, then it might NOT match a rule. So, I want to route the
> [unmatched] into this audit logging where I can then see this and decide
> what to do to handle this new file type. I know there is providence but I
> want long term audits and don't need to retain the flow file content and,
> frankly, it is just easier to work with plain audit logs (grep/awk/etc),
> IMO, and for my needs at this time.
>
> Problem is, InvokeHTTP never invokes. Why not? I think it would be nice if
> InvokeHTTP would hit the URL without a flow file, as in my case, I never
> want to actually send any body content but rather access attributes to build
> up a URL with param=value pairs to send. I am probably off base here. Any
> ideas? I saw this exchange but this seems kludgy using the generate flow
> file just to "trick" Invoke HTTP into doing something.
>
> https://mail-archives.apache.org/mod_mbox/nifi-users/201509.mbox/%3CCALJK9a4U29Ua9zVT9oPCYHMg=k4Vneao30EMszRe3yPkz6xxJQ@mail.gmail.com%3E
>
> I am trying to adapt to the "Nifi" paradigm but feel like I am swimming
> upstream at times.
>
> Thanks
>
>