You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nifi.apache.org by Malthe <mb...@gmail.com> on 2019/07/30 14:47:10 UTC

[DISCUSS] Streaming or "lazy" mode for `CompressContent`

In reference to NIFI-6496 [1], I'd like to open a discussion on adding
compression support to flow files such that a processor such as
`CompressContent` might function in a streaming or "lazy" mode.

Context, more details and initial feedback can be found in the ticket
referenced below as well as in a related SO entry [2].

[1] https://issues.apache.org/jira/browse/NIFI-6496
[2] https://stackoverflow.com/questions/57005564/using-convertrecord-on-compressed-input

Re: [DISCUSS] Streaming or "lazy" mode for `CompressContent`

Posted by Edward Armes <ed...@gmail.com>.
Joe,

My concern is that the record reading and writing as it stands isn't as
clear as it could be, and this could make it worse. I personally did find
it a little difficult understanding how some record processing processors
worked.

That aside however, I think that if a "flow level"/Processes Group settings
on compression was added, it would potentially work as a general solution.
What I'm thinking here is that as content leaves a processor it's checked
to see if it is already compressed and if it isn't compress it on the way
to the content repo and if it is. It's leave it alone. On the reverse, once
content is read from the content repo it is again intercepted and
de-compressed as it's loaded into processor, there would potentially need
to be a flag added for a processors to indicate to the core that it
shouldn't need to de-compress the input.

As for handling the compression algorithms maybe extending the
plugin-discovery functionality used for repo implementations could be used
to sliently detect compression formats and algorithms?

I think it work for record and non-record data be it text or binary.

Edward


On Tue, Jul 30, 2019 at 5:42 PM Joe Witt <jo...@gmail.com> wrote:

> Edward,
>
> I like your point/comment regarding separation of concerns/cohesion.  I
> think we could/should consider automatically decompressing data on the fly
> for processors in general in the event we know a given set of data to be
> compressed but being accessed for plaintext purposes.  For general block
> compression types this is probably fair game and could be quite compelling
> particularly to avoid the extra read/write/content repo hits involved.
>
> That said, I think for the case of record readers/writers I'm not sure we
> can avoid having a specific solution.  Some compression types can be
> concatted together and some cannot.  Some record types would be
> tolerant/still valid and some would not.
>
> Thanks
> Joe
>
> On Tue, Jul 30, 2019 at 12:34 PM Edward Armes <ed...@gmail.com>
> wrote:
>
> > So while I agree with in principle and it is a good idea on paper.
> >
> >  My concern is that this starts to add a bolt-on bloat problem. The Nifi
> > processors as they stand in general do follow the Unix Philosophy (Do one
> > thing, and do it well). My concern is while it could just be a case with
> > just adding a wrapper is that it then becomes an ask to just add the
> > wrapper to other processors to add similar functionalty or other. This
> does
> > start to cause a technical debt problem and also start to potentially a
> > detrimental experience to the user. Some of this I have mentioned in the
> > previous thread about the re-structuring the Nifi core.
> >
> > The reason why I suggest doing it either at the repo level or as the
> > InputStream is handed over to the processor from the core is that it adds
> > it as a global piece of functionality, which every processor that
> processes
> > data that compress well could benefit from. Now ideally it would be nice
> to
> > see it as a "per-flow" setting but I suspect that would be adding more
> > complexity, than is actually needed.
> >
> > I have seen an issue where over the time the content repo took up quite a
> > chunk of disk, for a multi-tenanted cluster that performed lots of small
> > changes on lots of FlowFiles, now while the hosts were under resourced,
> > being able to have compressed the content and trading it off for speed of
> > data through the flow might have helped that situation quite a bit.
> >
> > Edward
> >
> > On Tue, Jul 30, 2019 at 4:21 PM Joe Witt <jo...@gmail.com> wrote:
> >
> > > Malthe
> > >
> > > I do see value in having the Record readers/writers understand and
> handle
> > > compression directly as it will avoid the extra disk hit of decompress,
> > > read, compress cycles using existing processes and further there are
> > cases
> > > where the compression is record specific and not just holistic block
> > > encryption.
> > >
> > > I think Koji offered a great description of how to start thinking about
> > > this.
> > >
> > > Thanks
> > >
> > > On Tue, Jul 30, 2019 at 10:47 AM Malthe <mb...@gmail.com> wrote:
> > >
> > > > In reference to NIFI-6496 [1], I'd like to open a discussion on
> adding
> > > > compression support to flow files such that a processor such as
> > > > `CompressContent` might function in a streaming or "lazy" mode.
> > > >
> > > > Context, more details and initial feedback can be found in the ticket
> > > > referenced below as well as in a related SO entry [2].
> > > >
> > > > [1] https://issues.apache.org/jira/browse/NIFI-6496
> > > > [2]
> > > >
> > >
> >
> https://stackoverflow.com/questions/57005564/using-convertrecord-on-compressed-input
> > > >
> > >
> >
>

Re: [DISCUSS] Streaming or "lazy" mode for `CompressContent`

Posted by Malthe <mb...@gmail.com>.
Joe,

I think it might not be necessary or desirable to expose this outside
of the `CompressContent` processor. Whether the processor operates in
a lazy mode (as proposed here) or the current eager mode shouldn't
change the behavior of the flow. The next process (or processes) will
not know the difference.

The benefit of this approach is that it really introduces no new
concepts and is basically just a way for people to optimize existing
and already working flows.

Thanks

On Tue, 30 Jul 2019 at 16:42, Joe Witt <jo...@gmail.com> wrote:
>
> Edward,
>
> I like your point/comment regarding separation of concerns/cohesion.  I
> think we could/should consider automatically decompressing data on the fly
> for processors in general in the event we know a given set of data to be
> compressed but being accessed for plaintext purposes.  For general block
> compression types this is probably fair game and could be quite compelling
> particularly to avoid the extra read/write/content repo hits involved.
>
> That said, I think for the case of record readers/writers I'm not sure we
> can avoid having a specific solution.  Some compression types can be
> concatted together and some cannot.  Some record types would be
> tolerant/still valid and some would not.
>
> Thanks
> Joe
>
> On Tue, Jul 30, 2019 at 12:34 PM Edward Armes <ed...@gmail.com>
> wrote:
>
> > So while I agree with in principle and it is a good idea on paper.
> >
> >  My concern is that this starts to add a bolt-on bloat problem. The Nifi
> > processors as they stand in general do follow the Unix Philosophy (Do one
> > thing, and do it well). My concern is while it could just be a case with
> > just adding a wrapper is that it then becomes an ask to just add the
> > wrapper to other processors to add similar functionalty or other. This does
> > start to cause a technical debt problem and also start to potentially a
> > detrimental experience to the user. Some of this I have mentioned in the
> > previous thread about the re-structuring the Nifi core.
> >
> > The reason why I suggest doing it either at the repo level or as the
> > InputStream is handed over to the processor from the core is that it adds
> > it as a global piece of functionality, which every processor that processes
> > data that compress well could benefit from. Now ideally it would be nice to
> > see it as a "per-flow" setting but I suspect that would be adding more
> > complexity, than is actually needed.
> >
> > I have seen an issue where over the time the content repo took up quite a
> > chunk of disk, for a multi-tenanted cluster that performed lots of small
> > changes on lots of FlowFiles, now while the hosts were under resourced,
> > being able to have compressed the content and trading it off for speed of
> > data through the flow might have helped that situation quite a bit.
> >
> > Edward
> >
> > On Tue, Jul 30, 2019 at 4:21 PM Joe Witt <jo...@gmail.com> wrote:
> >
> > > Malthe
> > >
> > > I do see value in having the Record readers/writers understand and handle
> > > compression directly as it will avoid the extra disk hit of decompress,
> > > read, compress cycles using existing processes and further there are
> > cases
> > > where the compression is record specific and not just holistic block
> > > encryption.
> > >
> > > I think Koji offered a great description of how to start thinking about
> > > this.
> > >
> > > Thanks
> > >
> > > On Tue, Jul 30, 2019 at 10:47 AM Malthe <mb...@gmail.com> wrote:
> > >
> > > > In reference to NIFI-6496 [1], I'd like to open a discussion on adding
> > > > compression support to flow files such that a processor such as
> > > > `CompressContent` might function in a streaming or "lazy" mode.
> > > >
> > > > Context, more details and initial feedback can be found in the ticket
> > > > referenced below as well as in a related SO entry [2].
> > > >
> > > > [1] https://issues.apache.org/jira/browse/NIFI-6496
> > > > [2]
> > > >
> > >
> > https://stackoverflow.com/questions/57005564/using-convertrecord-on-compressed-input
> > > >
> > >
> >

Re: [DISCUSS] Streaming or "lazy" mode for `CompressContent`

Posted by Joe Witt <jo...@gmail.com>.
Edward,

I like your point/comment regarding separation of concerns/cohesion.  I
think we could/should consider automatically decompressing data on the fly
for processors in general in the event we know a given set of data to be
compressed but being accessed for plaintext purposes.  For general block
compression types this is probably fair game and could be quite compelling
particularly to avoid the extra read/write/content repo hits involved.

That said, I think for the case of record readers/writers I'm not sure we
can avoid having a specific solution.  Some compression types can be
concatted together and some cannot.  Some record types would be
tolerant/still valid and some would not.

Thanks
Joe

On Tue, Jul 30, 2019 at 12:34 PM Edward Armes <ed...@gmail.com>
wrote:

> So while I agree with in principle and it is a good idea on paper.
>
>  My concern is that this starts to add a bolt-on bloat problem. The Nifi
> processors as they stand in general do follow the Unix Philosophy (Do one
> thing, and do it well). My concern is while it could just be a case with
> just adding a wrapper is that it then becomes an ask to just add the
> wrapper to other processors to add similar functionalty or other. This does
> start to cause a technical debt problem and also start to potentially a
> detrimental experience to the user. Some of this I have mentioned in the
> previous thread about the re-structuring the Nifi core.
>
> The reason why I suggest doing it either at the repo level or as the
> InputStream is handed over to the processor from the core is that it adds
> it as a global piece of functionality, which every processor that processes
> data that compress well could benefit from. Now ideally it would be nice to
> see it as a "per-flow" setting but I suspect that would be adding more
> complexity, than is actually needed.
>
> I have seen an issue where over the time the content repo took up quite a
> chunk of disk, for a multi-tenanted cluster that performed lots of small
> changes on lots of FlowFiles, now while the hosts were under resourced,
> being able to have compressed the content and trading it off for speed of
> data through the flow might have helped that situation quite a bit.
>
> Edward
>
> On Tue, Jul 30, 2019 at 4:21 PM Joe Witt <jo...@gmail.com> wrote:
>
> > Malthe
> >
> > I do see value in having the Record readers/writers understand and handle
> > compression directly as it will avoid the extra disk hit of decompress,
> > read, compress cycles using existing processes and further there are
> cases
> > where the compression is record specific and not just holistic block
> > encryption.
> >
> > I think Koji offered a great description of how to start thinking about
> > this.
> >
> > Thanks
> >
> > On Tue, Jul 30, 2019 at 10:47 AM Malthe <mb...@gmail.com> wrote:
> >
> > > In reference to NIFI-6496 [1], I'd like to open a discussion on adding
> > > compression support to flow files such that a processor such as
> > > `CompressContent` might function in a streaming or "lazy" mode.
> > >
> > > Context, more details and initial feedback can be found in the ticket
> > > referenced below as well as in a related SO entry [2].
> > >
> > > [1] https://issues.apache.org/jira/browse/NIFI-6496
> > > [2]
> > >
> >
> https://stackoverflow.com/questions/57005564/using-convertrecord-on-compressed-input
> > >
> >
>

Re: [DISCUSS] Streaming or "lazy" mode for `CompressContent`

Posted by Edward Armes <ed...@gmail.com>.
So while I agree with in principle and it is a good idea on paper.

 My concern is that this starts to add a bolt-on bloat problem. The Nifi
processors as they stand in general do follow the Unix Philosophy (Do one
thing, and do it well). My concern is while it could just be a case with
just adding a wrapper is that it then becomes an ask to just add the
wrapper to other processors to add similar functionalty or other. This does
start to cause a technical debt problem and also start to potentially a
detrimental experience to the user. Some of this I have mentioned in the
previous thread about the re-structuring the Nifi core.

The reason why I suggest doing it either at the repo level or as the
InputStream is handed over to the processor from the core is that it adds
it as a global piece of functionality, which every processor that processes
data that compress well could benefit from. Now ideally it would be nice to
see it as a "per-flow" setting but I suspect that would be adding more
complexity, than is actually needed.

I have seen an issue where over the time the content repo took up quite a
chunk of disk, for a multi-tenanted cluster that performed lots of small
changes on lots of FlowFiles, now while the hosts were under resourced,
being able to have compressed the content and trading it off for speed of
data through the flow might have helped that situation quite a bit.

Edward

On Tue, Jul 30, 2019 at 4:21 PM Joe Witt <jo...@gmail.com> wrote:

> Malthe
>
> I do see value in having the Record readers/writers understand and handle
> compression directly as it will avoid the extra disk hit of decompress,
> read, compress cycles using existing processes and further there are cases
> where the compression is record specific and not just holistic block
> encryption.
>
> I think Koji offered a great description of how to start thinking about
> this.
>
> Thanks
>
> On Tue, Jul 30, 2019 at 10:47 AM Malthe <mb...@gmail.com> wrote:
>
> > In reference to NIFI-6496 [1], I'd like to open a discussion on adding
> > compression support to flow files such that a processor such as
> > `CompressContent` might function in a streaming or "lazy" mode.
> >
> > Context, more details and initial feedback can be found in the ticket
> > referenced below as well as in a related SO entry [2].
> >
> > [1] https://issues.apache.org/jira/browse/NIFI-6496
> > [2]
> >
> https://stackoverflow.com/questions/57005564/using-convertrecord-on-compressed-input
> >
>

Re: [DISCUSS] Streaming or "lazy" mode for `CompressContent`

Posted by Joe Witt <jo...@gmail.com>.
Malthe

I do see value in having the Record readers/writers understand and handle
compression directly as it will avoid the extra disk hit of decompress,
read, compress cycles using existing processes and further there are cases
where the compression is record specific and not just holistic block
encryption.

I think Koji offered a great description of how to start thinking about
this.

Thanks

On Tue, Jul 30, 2019 at 10:47 AM Malthe <mb...@gmail.com> wrote:

> In reference to NIFI-6496 [1], I'd like to open a discussion on adding
> compression support to flow files such that a processor such as
> `CompressContent` might function in a streaming or "lazy" mode.
>
> Context, more details and initial feedback can be found in the ticket
> referenced below as well as in a related SO entry [2].
>
> [1] https://issues.apache.org/jira/browse/NIFI-6496
> [2]
> https://stackoverflow.com/questions/57005564/using-convertrecord-on-compressed-input
>