You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@nifi.apache.org by GitBox <gi...@apache.org> on 2019/04/25 22:37:16 UTC

[GitHub] [nifi] arenger opened a new pull request #3455: NIFI-5900 Add SelectJson processor

arenger opened a new pull request #3455: NIFI-5900 Add SelectJson processor
URL: https://github.com/apache/nifi/pull/3455
 
 
   
   ### Overview
   
   The goal of this PR is to further fortify NiFi when working with large JSON files.  As noted in the [NiFi overview](https://nifi.apache.org/docs/nifi-docs/html/overview.html), systems will invariably receive "data that is too big, too small, too fast, too slow, corrupt, wrong, or in the wrong format."  In the case of "too big", NiFi (or any JVM) can continue just fine and handle large files with ease if it does so in a streaming fashion, but the current JSON processors use a DOM approach that is limited by available heap space.  This PR recommends the addition of a `SelectJson` processor that can be employed when large JSON files are expected or possible.
   
   The current `EvaluateJsonPath` and `SplitJson` processors both leverage the [Jayway JsonPath](https://github.com/json-path/JsonPath) library.  The Jayway implementation has excellent support for JSON Path expressions, but requires that the entire JSON file be loaded into memory.  It builds a document object model (DOM) before evaluating the targeted JSON Path.  This is already noted as a "System Resource Consideration" in the [documentation](https://github.com/apache/nifi/blob/rel/nifi-1.9.1/nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/SplitJson.java#L85) for the `SplitJson` processor, and the same is true for `EvaluateJsonPath`.
   
   The proposed `SelectJson` processor uses an alternate library called [JsonSurfer](https://github.com/jsurfer/JsonSurfer) to evaluate a JSON Path without loading the whole document into memory all at once, similar to SAX implementations for XML processing. This allows for near-constant memory usage, independent of file size, as shown in the following test results:
   
   ![SelectJsonMemory](https://user-images.githubusercontent.com/1693576/56772330-a059db00-6787-11e9-9cd2-08d201bfb7ab.png)
   
   The trade-off is between heap space usage and JSON Path functionality.  The `SelectJson` processor supports almost all of JSON Path, with a few limitations mentioned in the `@CapabilityDescription`.  For full JSON Path support and/or multiple JSON Path expressions, `EvaluateJsonPath` and/or `SplitJson` processor should be used.  When memory conservation is important, the `SelectJson` processor should be used.
   
   ### Licensing
   
   The [JsonSurfer](https://github.com/jsurfer/JsonSurfer) library is covered by the MIT License which is [compatible with Apache 2.0](https://www.apache.org/legal/resolved.html#category-a).  
   
   ### Testing
   
   This PR is a follow-on from #3414 in which I proposed a similar solution that required extenseive unit testing.  Tests from that PR were adapted and preserved for this PR, even though many of them are testing the `JsonSurf` library.  This is a much simpler PR since the path processing is handled in a third-party library.
   
   As for the memory statistics noted above, they were gathered using the same methodology described in #3414.  For posterity, here's a python script to generate JSON files of arbitrary size:
   
   ```
   import uuid
   
   (I, J, K) = (1, 8737, 3)
   with open('out.json', 'w') as f:
       f.write("[")
       for i in range(0,I):
           f.write("[")
           for j in range(0,J):
               f.write("[")
               for k in range(0,K):
                   f.write('"' + str(uuid.uuid4()) + '"');
                   if (k < K - 1):
                       f.write(",")
               f.write("],\n" if j < J - 1 else "]\n")
           f.write("],\n" if i < I - 1 else "]\n")
       f.write("]\n")
   ```
   
   ### How to use SelectJson Processor
   
   Given an incoming FlowFile and a valid JSON Path setting, `SelectJson` will send one or more FlowFiles to the `selected` relation, and the original FlowFile will be sent to the `original` relation.  If JSON Path did not match any object or array in the document, then the document will be passed to the `failure` relation.
   
   #### JSON Path Examples
   
   Here is a sample JSON file, followed by JSON Path expressions and the content of the FlowFiles that would be output from the `SplitLargeJson` processor.
   
   Sample JSON:
   ```
   [
     {
       "name": "Seattle",
       "weather": [
         {
           "main": "Snow",
           "description": "light snow"
         }
       ]
     },
     {
       "name": "Washington, DC",
       "weather": [
         {
           "main": "Mist",
           "description": "mist"
         },
         {
           "main": "Fog",
           "description": "fog"
         }
       ]
     }
   ]
   ```
   
   * JSON Path Expression: `$[1].weather.*`
       - FlowFile 0: `{"main":"Mist","description":"mist"}`
       - FlowFile 1: `{"main":"Fog","description":"fog"}`
   * JSON Path Expression: `$[1].name`
       - FlowFile 0: `"Washington, DC"`
   * JSON Path Expression: `$[*]['weather'][*]['main']`
       - FlowFile 0: `"Snow"`
       - FlowFile 1: `"Mist"`
       - FlowFile 2: `"Fog"`
   
   
   ### Checklist
   
   - [x] Is there a JIRA ticket associated with this PR? Is it referenced in the commit message?
   - [x] Does your PR title start with NIFI-XXXX where XXXX is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character.
   - [x] Has your PR been rebased against the latest commit within the target branch (typically master)?
   - [x] Is your initial contribution a single, squashed commit?
   
   - [ ] Have you ensured that the full suite of tests is executed via mvn -Pcontrib-check clean install at the root nifi folder?
         (Note: `mvn clean install` completes without error after disabling `FileBasedClusterNodeFirewallTest` and `DBCPServiceTest`.
          Adding `-Pcontrib-check` fails , but it appears to fail on `master` branch too)
   - [x] Have you written or updated unit tests to verify your changes?
   - [x] If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under [ASF 2.0](http://www.apache.org/legal/resolved.html#category-a)? 
   - [x] If applicable, have you updated the LICENSE file, including the main LICENSE file under nifi-assembly?
   - [x] If applicable, have you updated the NOTICE file, including the main NOTICE file found under nifi-assembly?
   - [x] If adding new Properties, have you added .displayName in addition to .name (programmatic access) for each of the new properties?
   
   - [ ] Have you ensured that format looks appropriate for the output in which it is rendered?
   
   ### See Also
   SplitLargeJson: #3414
   StreamingJsonReader: #3222
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services