You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nifi.apache.org by Andre <an...@fucs.org> on 2015/12/08 09:07:28 UTC

At what stage data become's NiFi's "problem"

All,

Still working on the lumberjack processor. Data is currently being decoded,
SSL is sort of working but before I start wrapping up I wanted to confirm:

Lumberjack is a protocol that includes the dispatch of an acknowledgement
message to the producing agent.

As consequence, usually a producer tailing a file will only update its
offset AFTER receiving the acknowledgement from the lumberjack endpoint.

Ideally this acknowledgement should only be sent after data is no longer in
the processor memory buffers and the chances of memory loss are restricted
to catastrophic failure.

Which leads to my question: From a development point of view, at what stage
data is assumed to be under NiFi's care?

I thank you in advance.

Re: At what stage data become's NiFi's "problem"

Posted by Aldrin Piri <al...@gmail.com>.
I think Joe's perspective maps more closely to what Andre was searching for
in terms of a knowing when a consumer can be notified/guaranteed of
successful handoff of data in the overall flow process. Primarily, the key
factor is that this mechanism provides at least once delivery in that the
unit of work for accepting this data completes before acknowledgement
creates the round trip of the transaction; any speed bumps along the way
after that commit occurs could cause the possible acknowledgement to not
make it back to your producer.  This ties into Oleg's point about
catastrophic failure, as unfortunate circumstances depending on timing
could cause data duplication as highlighted in the developer guide.
Regardless, this data is captured in the content repository and enjoys the
same copy on write/pass by reference semantics that underpin a lot of
NiFi's performance.

Oleg's first point picks up at the juncture where data has moved beyond the
initial consumption outlined by yourself an above and details the process
and ties into the content repository's key features.  While that data will
get streamed in by the consumer and enters the purview of NiFi that
ownership does not occur until the aforementioned commit.  If exactly once
semantics is something that is important for a particular application,
there are ways of greatly aiding that process using something like
DetectDuplicate driven by a background cache.

After that commit, that particular file could have many different paths and
ways in which it is processed with varying outcomes.

Awesome to hear you are continuing work on extending the capabilities and
we look forward to aiding further in your contribution.  Excellent question
to be mindful of in the course of being a responsible producer in ensuring
the data delivery.

On Tue, Dec 8, 2015 at 5:00 AM, Oleg Zhurakousky <
ozhurakousky@hortonworks.com> wrote:

> At the high level we try not to copy anything unless we have to, so when
> you say “under NiFi care” it becomes a bit unclear. For example, one may be
> copying a file using zero-copy algorithm. Let’s assume that NiFi was the
> facilitator of that process. With that said, the data is/was never under
> NiFi management because nothing was read into memory to perform copy. Now,
> even if something is read in memory, what does it really mean from your
> perspective? Technically one may argue that ‘record’ is now under NiFi
> management and it could be acknowledged. But what if somewhere downstream
> the processing of this record fails?
>
> Basically, IMHO your question is about Transactional capabilities where
> transaction implies that acknowledgment will be provided *only* when a
> record is fully processed and its re-processing may never happen again with
> the exception of catastrophic failures.
> If, so giving asynchronous nature of NiFi, it may not be as straight
> forward process, albeit doable.
>
> But before we get to that, let us know if my rumblings above are not
> totally off ;).
>
> Cheers
> Oleg
>
> > On Dec 8, 2015, at 3:07 AM, Andre <an...@fucs.org> wrote:
> >
> > All,
> >
> > Still working on the lumberjack processor. Data is currently being
> decoded,
> > SSL is sort of working but before I start wrapping up I wanted to
> confirm:
> >
> > Lumberjack is a protocol that includes the dispatch of an acknowledgement
> > message to the producing agent.
> >
> > As consequence, usually a producer tailing a file will only update its
> > offset AFTER receiving the acknowledgement from the lumberjack endpoint.
> >
> > Ideally this acknowledgement should only be sent after data is no longer
> in
> > the processor memory buffers and the chances of memory loss are
> restricted
> > to catastrophic failure.
> >
> > Which leads to my question: From a development point of view, at what
> stage
> > data is assumed to be under NiFi's care?
> >
> > I thank you in advance.
>
>

Re: At what stage data become's NiFi's "problem"

Posted by Oleg Zhurakousky <oz...@hortonworks.com>.
At the high level we try not to copy anything unless we have to, so when you say “under NiFi care” it becomes a bit unclear. For example, one may be copying a file using zero-copy algorithm. Let’s assume that NiFi was the facilitator of that process. With that said, the data is/was never under NiFi management because nothing was read into memory to perform copy. Now, even if something is read in memory, what does it really mean from your perspective? Technically one may argue that ‘record’ is now under NiFi management and it could be acknowledged. But what if somewhere downstream the processing of this record fails? 

Basically, IMHO your question is about Transactional capabilities where transaction implies that acknowledgment will be provided *only* when a record is fully processed and its re-processing may never happen again with the exception of catastrophic failures.
If, so giving asynchronous nature of NiFi, it may not be as straight forward process, albeit doable. 

But before we get to that, let us know if my rumblings above are not totally off ;).

Cheers
Oleg

> On Dec 8, 2015, at 3:07 AM, Andre <an...@fucs.org> wrote:
> 
> All,
> 
> Still working on the lumberjack processor. Data is currently being decoded,
> SSL is sort of working but before I start wrapping up I wanted to confirm:
> 
> Lumberjack is a protocol that includes the dispatch of an acknowledgement
> message to the producing agent.
> 
> As consequence, usually a producer tailing a file will only update its
> offset AFTER receiving the acknowledgement from the lumberjack endpoint.
> 
> Ideally this acknowledgement should only be sent after data is no longer in
> the processor memory buffers and the chances of memory loss are restricted
> to catastrophic failure.
> 
> Which leads to my question: From a development point of view, at what stage
> data is assumed to be under NiFi's care?
> 
> I thank you in advance.


Re: At what stage data become's NiFi's "problem"

Posted by Joe Witt <jo...@gmail.com>.
Andre,

Excellent.  Glad to hear you're making progress.  NiFi is said to be
under control of data once a FlowFile with its content has been
generated in a ProcessSession and that session has been committed.

A more detailed overview of this part of the process is here
https://nifi.apache.org/docs/nifi-docs/html/developer-guide.html#ingress

Thanks
Joe

On Tue, Dec 8, 2015 at 3:07 AM, Andre <an...@fucs.org> wrote:
> All,
>
> Still working on the lumberjack processor. Data is currently being decoded,
> SSL is sort of working but before I start wrapping up I wanted to confirm:
>
> Lumberjack is a protocol that includes the dispatch of an acknowledgement
> message to the producing agent.
>
> As consequence, usually a producer tailing a file will only update its
> offset AFTER receiving the acknowledgement from the lumberjack endpoint.
>
> Ideally this acknowledgement should only be sent after data is no longer in
> the processor memory buffers and the chances of memory loss are restricted
> to catastrophic failure.
>
> Which leads to my question: From a development point of view, at what stage
> data is assumed to be under NiFi's care?
>
> I thank you in advance.