You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@nifi.apache.org by GitBox <gi...@apache.org> on 2019/04/09 15:12:49 UTC

[GitHub] [nifi] arenger commented on issue #3414: NIFI-5900 Added SplitLargeJson processor

arenger commented on issue #3414: NIFI-5900 Added SplitLargeJson processor
URL: https://github.com/apache/nifi/pull/3414#issuecomment-481294396

@ottobackwards I originally sought to improve `SplitJson` instead of adding a new processor. I could certainly submit a different PR targeting an improvement to `SplitJson`, but there were a few reasons I thought a different processor might be better:

1. The `SplitLargeJson` processor is designed to always output complete JSON documents. This differs from the `SplitJson` behavior. For example, when splitting an array of strings, `SplitJson` would output `String1`, `String2`, etc, but `SplitLargeJson` would output `["String1"]`, `["String2"]`, etc. This can be advantageous when the output relation (the split-relation) is directed to another processor that expects JSON.
2. The `SplitJson` processor can only split arrays. The JSON Path must target an array in the document. However, `SplitLargeJson` can split arrays _and_ objects. If the JSON Path points to an object then it will output all the key-value pairs of that object in separate flowfiles.
3. The `SplitJson` processor sets a `fragment.count` attribute on outgoing flowfiles to indicate the total number of documents that were split from the designated JSON Path. This is by nature impossible when using a sax-like (streaming) approach to reading the JSON because the processor is designed to avoid loading the whole document into memory at the same time. Therefore, in order to preserve the current function, a setting would need to be added to optionally engage the optimized handling for large files -- with a stated caveat that the `fragment.count` attribute would be unavailable.

Again, I could submit a different pull request that targets an optimization of `SplitJson` rather than an addition of a new `SplitLargeJson` processor. I started down that path originally, with a boolean setting to optionally activate large file processing (and in that mode it could also split objects, provided the JSON Path was not "overly complex" [i.e. require backtracking, etc]) -- but then I had to change the processor to occasionally output non-json documents which made the code less elegant. That said, I could see the value in sticking with one processor.

As for JsonSurfer, I had honestly never heard of it. My code here was from a work project I did a couple years ago that was finally approved for release to the public. I could probably make a change to `SplitJson` that employs JsonSurfer... I'm bummed my code isn't as novel as I'd hoped, but I know that's how things go!

Let me know what you think is best.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

With regards,
Apache Git Services