You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@nifi.apache.org by Mark Petronic <ma...@gmail.com> on 2015/11/06 06:05:33 UTC

MergeContent - Question on interplay of max bins, duration, and size with numerous correlation variable values

I was expecting that, if is set min bin size to 128 mb and max to 512 mb
and bin duration to 60s and max bins to 100 and if data was flowing quick
enough so that I received more than 512 MB in 60 sec (all flow files are
keyed to the same correlation variable in the case), that I would see
output flow files of around the max of 512 mb. But that is not what I see.
I played around with changing the max bins and duration but still don't
seem to be able to "force" large files. Instead I see files around 100 -150
mb. Can someone point me to a more detailed description of how the binning
logic works? Would like to understand the interplay between the number of
bins, duration, and size when you have sets of flow files coming in that
are linked to different correlation variables. In my case, if I process all
my file types, I have about 19 different classes of data so there are 19
different values for the correlation variable I use "StatClass". Why would
one want many or few max bins? Does a larger value of duration will put
more memory pressure on the JVM or are the bins accumulated as files on
disk rather than in memory? I am trying to produce large files for HDFS
storage from a stream of many smaller files.

Thanks

Re: MergeContent - Question on interplay of max bins, duration, and size with numerous correlation variable values

Posted by Andre <an...@fucs.org>.
Mark,

I asked this question a few week ago.

Here's the thread (Subject: Interaction between MergeContent parameters):

http://mail-archives.apache.org/mod_mbox/nifi-users/201510.mbox/%3CCACkT4wYfUS_9WFuK8YUy88AKKGTTrgvpNOFhxok_QUA_J%3DCtGg%40mail.gmail.com%3E

Cheers


On Fri, Nov 6, 2015 at 4:05 PM, Mark Petronic <ma...@gmail.com> wrote:
> I was expecting that, if is set min bin size to 128 mb and max to 512 mb and
> bin duration to 60s and max bins to 100 and if data was flowing quick enough
> so that I received more than 512 MB in 60 sec (all flow files are keyed to
> the same correlation variable in the case), that I would see output flow
> files of around the max of 512 mb. But that is not what I see. I played
> around with changing the max bins and duration but still don't seem to be
> able to "force" large files. Instead I see files around 100 -150 mb. Can
> someone point me to a more detailed description of how the binning logic
> works? Would like to understand the interplay between the number of bins,
> duration, and size when you have sets of flow files coming in that are
> linked to different correlation variables. In my case, if I process all my
> file types, I have about 19 different classes of data so there are 19
> different values for the correlation variable I use "StatClass". Why would
> one want many or few max bins? Does a larger value of duration will put more
> memory pressure on the JVM or are the bins accumulated as files on disk
> rather than in memory? I am trying to produce large files for HDFS storage
> from a stream of many smaller files.
>
> Thanks