You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@nifi.apache.org by Gautier DARAS <ga...@gmail.com> on 2018/08/24 14:26:47 UTC

Merge Record Processor behavior

 Hi,

were setting up a flow file that ingest small records (consisting of one
short csv line) from kafka, then treat / filter / rout / merge / compress
those to write it in hdfs.

We do not understand well how the merge processor should be setup has it
does not work as we expect.

We want it to merge records in flowfiles that will at the end fill our hdfs
blocks (128 MB for now) .

Here are the merge processor parameters:

Min Number of record : 1000000
Min bin size : 200MB
Max Bin size 250 MB
Max number of Bins:1
In the setting we let the Concurrent Task paramater to 1

As you can see we set up a higher Max Bin Size, than what we want because
the flowfiles are compressed on a further processor.

But we observed that while we specify a max number of Bins of 1 and a
minimum Bin Size of 200MB, the resulting comportments does not barely
respect those parameters: creating two small flowfiles (around 25 MB) at a
time while the queue contains enough element to fill one with 128 MB.

So our question is if there is a way to parameter our processors to achieve
our goal: filling hdfs with files of size around 120 MB.

Thanks in advance,

Gautier.

Re: Merge Record Processor behavior

Posted by James Wing <jv...@gmail.com>.
Gautier,

I'm not certain exactly what is wrong.  But as an experiment, please try
setting the "Max number of Bins" to be greater than 1 (2 might be enough).
My suspicion is that when you are using 100% of the allowed bins, the
processor attempts to process the oldest bin every time.  Because you allow
only 1 bin, the bin is always oldest.  If 2 bins were allowed, you would
use 1 and have 1 available, but never used.

You did not ask, but you might also consider using a sequence of two merge
processors, one to bin single records into bundles of 1,000 or 10,000,
followed by a second to achieve 1,000,000 / 128 MB.  This will help reduce
the number of simultaneous flowfiles and keep the flow rate steadier.


On Fri, Aug 24, 2018 at 7:27 AM Gautier DARAS <ga...@gmail.com>
wrote:

> Hi,
>
> were setting up a flow file that ingest small records (consisting of one
> short csv line) from kafka, then treat / filter / rout / merge / compress
> those to write it in hdfs.
>
> We do not understand well how the merge processor should be setup has it
> does not work as we expect.
>
> We want it to merge records in flowfiles that will at the end fill our
> hdfs blocks (128 MB for now) .
>
> Here are the merge processor parameters:
>
> Min Number of record : 1000000
> Min bin size : 200MB
> Max Bin size 250 MB
> Max number of Bins:1
> In the setting we let the Concurrent Task paramater to 1
>
> As you can see we set up a higher Max Bin Size, than what we want because
> the flowfiles are compressed on a further processor.
>
> But we observed that while we specify a max number of Bins of 1 and a
> minimum Bin Size of 200MB, the resulting comportments does not barely
> respect those parameters: creating two small flowfiles (around 25 MB) at a
> time while the queue contains enough element to fill one with 128 MB.
>
> So our question is if there is a way to parameter our processors to
> achieve our goal: filling hdfs with files of size around 120 MB.
>
> Thanks in advance,
>
> Gautier.
>
>