You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@nifi.apache.org by Benjamin Janssen <bj...@gmail.com> on 2018/08/10 20:27:02 UTC

Large JSON File Best Practice Question

All, I'm seeking some advice on best practices for dealing with FlowFiles
that contain a large volume of JSON records.

My flow works like this:

Receive a FlowFile with millions of JSON records in it.

Potentially filter out some of the records based on the value of the JSON
fields.  (custom processor uses a regex and a json path to produce a
"matched" and "not matched" output path)

Potentially split the FlowFile into multiple FlowFiles based on the value
of one of the JSON fields (custom processor uses a json path and groups
into output FlowFiles based on the value).

Potentially split the FlowFile into uniformly sized smaller chunks to
prevent choking downstream systems on the file size (we use SplitText when
the data is newline delimited, don't currently have a way when the data is
a JSON array of records)

Strip out some of the JSON fields (using a JoltTransformJSON).

At the end, wrap each JSON record in a proprietary format (custom processor
wraps each JSON record)

This flow is roughly similar across several different unrelated data sets.

The input data files are occasionally provided in a single JSON array and
occasionally as newline delimited JSON records.  In general, we've found
newline delimited JSON records far easier to work with because we can
process them one at a time without loading the entire FlowFile into memory
(which we have to do for the array variant).

However, if we are to use JoltTransformJSON to strip out or modify some of
the JSON contents, it appears to only operate on an array (which is
problematic from the memory footprint standpoint).

We don't really want to break our FlowFiles up into individual JSON records
as the number of FlowFiles the system would have to handle would be orders
of magnitudes larger than it is now.

Is our approach of moving towards newline delimited JSON a good one?  If
so, is there anything that would be recommended for replacing
JoltTransformJSON?  Or should we build a custom processor?  Or is this a
reasonable feature request for the JoltTransformJSON processor to support
new line delimited json?

Or should we be looking into ways to do lazy loading of the JSON arrays in
our custom processors (I have no clue how easy or hard this would be to
do)?  My little bit of googling suggests this would be difficult.

Re: Large JSON File Best Practice Question

Posted by Joe Witt <jo...@gmail.com>.
Ben,

I'm not sure that you could reliably convert the format of data and
retain schema information unless both formats allow for explicit
schema retention in them (as Avro does for instance).  JSON doesn't
really offer that.  So when you say you want to convert even for
unknown fields but there is no explicit type information/schema
information to follow it I'm not sure what a non destructive/lossy
conversion would look like.

You might still want to give the existing readers/writers a go to
experiment and find the line of how far you can go.  You could also
script or write your own reader which extracts a sufficient (for your
purposes) schema in the process perhaps and places it on the flowfile.

Thanks
On Fri, Aug 10, 2018 at 4:47 PM Benjamin Janssen <bj...@gmail.com> wrote:
>
> I am not.  I continued googling for a bit after sending my email and stumbled upon a slide deck by Brian Bende.  I think my initial concern looking at it is that it seems to require schema knowledge.
>
> For most of our data sets, we operate in a space where we have a handful of guaranteed fields and who knows what other fields the upstream provider is going to send us.  We want to operate on the data in a manner that is non-destructive to unanticipated fields.  Is that a blocker for using the RecordReader stuff?
>
> On Fri, Aug 10, 2018 at 4:30 PM Joe Witt <jo...@gmail.com> wrote:
>>
>> ben
>>
>> are you familiar with the record readers, writers, and associated processors?
>>
>> i suspect if you make a record writer for your custom format at the end of the flow chain youll get great performance and control.
>>
>> thanks
>>
>> On Fri, Aug 10, 2018, 4:27 PM Benjamin Janssen <bj...@gmail.com> wrote:
>>>
>>> All, I'm seeking some advice on best practices for dealing with FlowFiles that contain a large volume of JSON records.
>>>
>>> My flow works like this:
>>>
>>> Receive a FlowFile with millions of JSON records in it.
>>>
>>> Potentially filter out some of the records based on the value of the JSON fields.  (custom processor uses a regex and a json path to produce a "matched" and "not matched" output path)
>>>
>>> Potentially split the FlowFile into multiple FlowFiles based on the value of one of the JSON fields (custom processor uses a json path and groups into output FlowFiles based on the value).
>>>
>>> Potentially split the FlowFile into uniformly sized smaller chunks to prevent choking downstream systems on the file size (we use SplitText when the data is newline delimited, don't currently have a way when the data is a JSON array of records)
>>>
>>> Strip out some of the JSON fields (using a JoltTransformJSON).
>>>
>>> At the end, wrap each JSON record in a proprietary format (custom processor wraps each JSON record)
>>>
>>> This flow is roughly similar across several different unrelated data sets.
>>>
>>> The input data files are occasionally provided in a single JSON array and occasionally as newline delimited JSON records.  In general, we've found newline delimited JSON records far easier to work with because we can process them one at a time without loading the entire FlowFile into memory (which we have to do for the array variant).
>>>
>>> However, if we are to use JoltTransformJSON to strip out or modify some of the JSON contents, it appears to only operate on an array (which is problematic from the memory footprint standpoint).
>>>
>>> We don't really want to break our FlowFiles up into individual JSON records as the number of FlowFiles the system would have to handle would be orders of magnitudes larger than it is now.
>>>
>>> Is our approach of moving towards newline delimited JSON a good one?  If so, is there anything that would be recommended for replacing JoltTransformJSON?  Or should we build a custom processor?  Or is this a reasonable feature request for the JoltTransformJSON processor to support new line delimited json?
>>>
>>> Or should we be looking into ways to do lazy loading of the JSON arrays in our custom processors (I have no clue how easy or hard this would be to do)?  My little bit of googling suggests this would be difficult.

Re: Large JSON File Best Practice Question

Posted by Benjamin Janssen <bj...@gmail.com>.
I am not.  I continued googling for a bit after sending my email and
stumbled upon a slide deck by Brian Bende.  I think my initial concern
looking at it is that it seems to require schema knowledge.

For most of our data sets, we operate in a space where we have a handful of
guaranteed fields and who knows what other fields the upstream provider is
going to send us.  We want to operate on the data in a manner that is
non-destructive to unanticipated fields.  Is that a blocker for using the
RecordReader stuff?

On Fri, Aug 10, 2018 at 4:30 PM Joe Witt <jo...@gmail.com> wrote:

> ben
>
> are you familiar with the record readers, writers, and associated
> processors?
>
> i suspect if you make a record writer for your custom format at the end of
> the flow chain youll get great performance and control.
>
> thanks
>
> On Fri, Aug 10, 2018, 4:27 PM Benjamin Janssen <bj...@gmail.com>
> wrote:
>
>> All, I'm seeking some advice on best practices for dealing with FlowFiles
>> that contain a large volume of JSON records.
>>
>> My flow works like this:
>>
>> Receive a FlowFile with millions of JSON records in it.
>>
>> Potentially filter out some of the records based on the value of the JSON
>> fields.  (custom processor uses a regex and a json path to produce a
>> "matched" and "not matched" output path)
>>
>> Potentially split the FlowFile into multiple FlowFiles based on the value
>> of one of the JSON fields (custom processor uses a json path and groups
>> into output FlowFiles based on the value).
>>
>> Potentially split the FlowFile into uniformly sized smaller chunks to
>> prevent choking downstream systems on the file size (we use SplitText when
>> the data is newline delimited, don't currently have a way when the data is
>> a JSON array of records)
>>
>> Strip out some of the JSON fields (using a JoltTransformJSON).
>>
>> At the end, wrap each JSON record in a proprietary format (custom
>> processor wraps each JSON record)
>>
>> This flow is roughly similar across several different unrelated data sets.
>>
>> The input data files are occasionally provided in a single JSON array and
>> occasionally as newline delimited JSON records.  In general, we've found
>> newline delimited JSON records far easier to work with because we can
>> process them one at a time without loading the entire FlowFile into memory
>> (which we have to do for the array variant).
>>
>> However, if we are to use JoltTransformJSON to strip out or modify some
>> of the JSON contents, it appears to only operate on an array (which is
>> problematic from the memory footprint standpoint).
>>
>> We don't really want to break our FlowFiles up into individual JSON
>> records as the number of FlowFiles the system would have to handle would be
>> orders of magnitudes larger than it is now.
>>
>> Is our approach of moving towards newline delimited JSON a good one?  If
>> so, is there anything that would be recommended for replacing
>> JoltTransformJSON?  Or should we build a custom processor?  Or is this a
>> reasonable feature request for the JoltTransformJSON processor to support
>> new line delimited json?
>>
>> Or should we be looking into ways to do lazy loading of the JSON arrays
>> in our custom processors (I have no clue how easy or hard this would be to
>> do)?  My little bit of googling suggests this would be difficult.
>>
>

Re: Large JSON File Best Practice Question

Posted by Joe Witt <jo...@gmail.com>.
ben

are you familiar with the record readers, writers, and associated
processors?

i suspect if you make a record writer for your custom format at the end of
the flow chain youll get great performance and control.

thanks

On Fri, Aug 10, 2018, 4:27 PM Benjamin Janssen <bj...@gmail.com> wrote:

> All, I'm seeking some advice on best practices for dealing with FlowFiles
> that contain a large volume of JSON records.
>
> My flow works like this:
>
> Receive a FlowFile with millions of JSON records in it.
>
> Potentially filter out some of the records based on the value of the JSON
> fields.  (custom processor uses a regex and a json path to produce a
> "matched" and "not matched" output path)
>
> Potentially split the FlowFile into multiple FlowFiles based on the value
> of one of the JSON fields (custom processor uses a json path and groups
> into output FlowFiles based on the value).
>
> Potentially split the FlowFile into uniformly sized smaller chunks to
> prevent choking downstream systems on the file size (we use SplitText when
> the data is newline delimited, don't currently have a way when the data is
> a JSON array of records)
>
> Strip out some of the JSON fields (using a JoltTransformJSON).
>
> At the end, wrap each JSON record in a proprietary format (custom
> processor wraps each JSON record)
>
> This flow is roughly similar across several different unrelated data sets.
>
> The input data files are occasionally provided in a single JSON array and
> occasionally as newline delimited JSON records.  In general, we've found
> newline delimited JSON records far easier to work with because we can
> process them one at a time without loading the entire FlowFile into memory
> (which we have to do for the array variant).
>
> However, if we are to use JoltTransformJSON to strip out or modify some of
> the JSON contents, it appears to only operate on an array (which is
> problematic from the memory footprint standpoint).
>
> We don't really want to break our FlowFiles up into individual JSON
> records as the number of FlowFiles the system would have to handle would be
> orders of magnitudes larger than it is now.
>
> Is our approach of moving towards newline delimited JSON a good one?  If
> so, is there anything that would be recommended for replacing
> JoltTransformJSON?  Or should we build a custom processor?  Or is this a
> reasonable feature request for the JoltTransformJSON processor to support
> new line delimited json?
>
> Or should we be looking into ways to do lazy loading of the JSON arrays in
> our custom processors (I have no clue how easy or hard this would be to
> do)?  My little bit of googling suggests this would be difficult.
>