You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@drill.apache.org by GitBox <gi...@apache.org> on 2022/03/13 21:34:04 UTC

[GitHub] [drill] cgivre opened a new pull request #2494: DRILL-8167: Add JSON Config Options to Format Config

cgivre opened a new pull request #2494:
URL: https://github.com/apache/drill/pull/2494


   # [DRILL-8167](https://issues.apache.org/jira/browse/DRILL-8167): Add JSON Config Options to Format Config
   
   ## Description
   Most all Drill format plugins allow the user to configure various options for that plugin as part of the format config.  The one glaring exception is the JSON reader which has several configuration options which can only be set globally.  This PR moves these to the format config so that users can set these options when they configure a storage plugin.  
   
   This PR does not eliminate the global settings for JSON.  It simply adds another place where a user can update the settings.  If the settings in the config file are not defined (`null`) Drill will use the global settings.
   
   The config is set to only include these values when they are not `null`, so there are no breaking changes.
   
   ## Documentation
   Drill's JSON reader can be configured with various global configuration variables.  However these variables can also be overridden in an individual storage plugin configuration.  The parameters are:
   
   * `allTextMode`:  When `true`, Drill will read all fields in a given JSON file as text.
   * `readNumbersAsDouble`:  When `true`, Drill will read all numbers as Doubles.  This is useful if your data contains fields with a mix of integers and floating point numbers.  A very common place this happens is when the record contains `0`.
   * `skipMalformedJSONRecords`:  When set to `true`, Drill will attempt to skip malformed JSON records.  When `false`, Drill will throw an exception for bad records.
   * `escapeAnyChar`:  Allows escaping of any character when set to `true`. 
   * `nanInf`:  Allows `NaN` and `Infinity` in JSON data when set to `true`. 
   
   A JSON config could look like this:
   
   ```json
   ...
   "json": {
      "type": "json",
      "extensions": ["json"],
      "allTextMode": true,
      "readNumbersAsDouble": true,
      "skipMalformedJSONRecords": true,
      "escapeAnyChar": false,
      "nanInf": true
   }
   ...
   ```
   
   You can also include these values at query time:
   
   ```sql
   SELECT `integer`, `float` 
   FROM table(cp.`jsoninput/input2.json` (type => 'json', allTextMode => True))"
   ```
   
   ## Testing
   Added unit tests.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] jnturton commented on a change in pull request #2494: DRILL-8167: Add JSON Config Options to Format Config

Posted by GitBox <gi...@apache.org>.
jnturton commented on a change in pull request #2494:
URL: https://github.com/apache/drill/pull/2494#discussion_r825601125



##########
File path: exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/json/JSONRecordReader.java
##########
@@ -124,19 +139,76 @@ private JSONRecordReader(FragmentContext fragmentContext, Path inputPath, JsonNo
       this.embeddedContent = embeddedContent;
     }
 
+    // If the config is null, create a temporary one with the global options.
+    if (config == null) {
+      this.config = new JSONFormatConfig(null,
+        embeddedContent == null && fragmentContext.getOptions().getOption(ExecConstants.JSON_READER_ALL_TEXT_MODE_VALIDATOR),
+        embeddedContent == null && fragmentContext.getOptions().getOption(ExecConstants.JSON_READ_NUMBERS_AS_DOUBLE_VALIDATOR),
+        fragmentContext.getOptions().getOption(ExecConstants.JSON_SKIP_MALFORMED_RECORDS_VALIDATOR),
+        fragmentContext.getOptions().getOption(ExecConstants.JSON_READER_ESCAPE_ANY_CHAR_VALIDATOR),
+        fragmentContext.getOptions().getOption(ExecConstants.JSON_READER_NAN_INF_NUMBERS_VALIDATOR));
+    } else {
+      this.config = config;
+    }
+
     this.fileSystem = fileSystem;
     this.fragmentContext = fragmentContext;
-    // only enable all text mode if we aren't using embedded content mode.
-    this.enableAllTextMode = embeddedContent == null && fragmentContext.getOptions().getOption(ExecConstants.JSON_READER_ALL_TEXT_MODE_VALIDATOR);
-    this.enableNanInf = fragmentContext.getOptions().getOption(ExecConstants.JSON_READER_NAN_INF_NUMBERS_VALIDATOR);
-    this.enableEscapeAnyChar = fragmentContext.getOptions().getOption(ExecConstants.JSON_READER_ESCAPE_ANY_CHAR_VALIDATOR);
-    this.readNumbersAsDouble = embeddedContent == null && fragmentContext.getOptions().getOption(ExecConstants.JSON_READ_NUMBERS_AS_DOUBLE_VALIDATOR);
+
+    this.enableAllTextMode = allTextMode();
+    this.enableNanInf = nanInf();
+    this.enableEscapeAnyChar = escapeAnyChar();
+    this.readNumbersAsDouble = readNumbersAsDouble();
     this.unionEnabled = embeddedContent == null && fragmentContext.getOptions().getBoolean(ExecConstants.ENABLE_UNION_TYPE_KEY);
-    this.skipMalformedJSONRecords = fragmentContext.getOptions().getOption(ExecConstants.JSON_SKIP_MALFORMED_RECORDS_VALIDATOR);
+    this.skipMalformedJSONRecords = skipMalformedJSONRecords();
     this.printSkippedMalformedJSONRecordLineNumber = fragmentContext.getOptions().getOption(ExecConstants.JSON_READER_PRINT_INVALID_RECORDS_LINE_NOS_FLAG_VALIDATOR);
     setColumns(columns);
   }
 
+  /**
+   * Returns the value of the all text mode.  Values set in the format config will override global values.
+   * @return The value of allTextMode
+   */
+  private boolean allTextMode() {
+    // only enable all text mode if we aren't using embedded content mode.
+    if (config.getAllTextMode() == null) {
+      return embeddedContent == null && fragmentContext.getOptions().getOption(ExecConstants.JSON_READER_ALL_TEXT_MODE_VALIDATOR);

Review comment:
       Please ensure that we preserve the existing option scope precedence: SYSTEM < FORMAT < SESSION < QUERY.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] cgivre commented on a change in pull request #2494: DRILL-8167: Add JSON Config Options to Format Config

Posted by GitBox <gi...@apache.org>.
cgivre commented on a change in pull request #2494:
URL: https://github.com/apache/drill/pull/2494#discussion_r826328608



##########
File path: exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/json/JSONRecordReader.java
##########
@@ -124,19 +139,76 @@ private JSONRecordReader(FragmentContext fragmentContext, Path inputPath, JsonNo
       this.embeddedContent = embeddedContent;
     }
 
+    // If the config is null, create a temporary one with the global options.
+    if (config == null) {
+      this.config = new JSONFormatConfig(null,
+        embeddedContent == null && fragmentContext.getOptions().getOption(ExecConstants.JSON_READER_ALL_TEXT_MODE_VALIDATOR),
+        embeddedContent == null && fragmentContext.getOptions().getOption(ExecConstants.JSON_READ_NUMBERS_AS_DOUBLE_VALIDATOR),
+        fragmentContext.getOptions().getOption(ExecConstants.JSON_SKIP_MALFORMED_RECORDS_VALIDATOR),
+        fragmentContext.getOptions().getOption(ExecConstants.JSON_READER_ESCAPE_ANY_CHAR_VALIDATOR),
+        fragmentContext.getOptions().getOption(ExecConstants.JSON_READER_NAN_INF_NUMBERS_VALIDATOR));
+    } else {
+      this.config = config;
+    }
+
     this.fileSystem = fileSystem;
     this.fragmentContext = fragmentContext;
-    // only enable all text mode if we aren't using embedded content mode.
-    this.enableAllTextMode = embeddedContent == null && fragmentContext.getOptions().getOption(ExecConstants.JSON_READER_ALL_TEXT_MODE_VALIDATOR);
-    this.enableNanInf = fragmentContext.getOptions().getOption(ExecConstants.JSON_READER_NAN_INF_NUMBERS_VALIDATOR);
-    this.enableEscapeAnyChar = fragmentContext.getOptions().getOption(ExecConstants.JSON_READER_ESCAPE_ANY_CHAR_VALIDATOR);
-    this.readNumbersAsDouble = embeddedContent == null && fragmentContext.getOptions().getOption(ExecConstants.JSON_READ_NUMBERS_AS_DOUBLE_VALIDATOR);
+
+    this.enableAllTextMode = allTextMode();
+    this.enableNanInf = nanInf();
+    this.enableEscapeAnyChar = escapeAnyChar();
+    this.readNumbersAsDouble = readNumbersAsDouble();
     this.unionEnabled = embeddedContent == null && fragmentContext.getOptions().getBoolean(ExecConstants.ENABLE_UNION_TYPE_KEY);
-    this.skipMalformedJSONRecords = fragmentContext.getOptions().getOption(ExecConstants.JSON_SKIP_MALFORMED_RECORDS_VALIDATOR);
+    this.skipMalformedJSONRecords = skipMalformedJSONRecords();
     this.printSkippedMalformedJSONRecordLineNumber = fragmentContext.getOptions().getOption(ExecConstants.JSON_READER_PRINT_INVALID_RECORDS_LINE_NOS_FLAG_VALIDATOR);
     setColumns(columns);
   }
 
+  /**
+   * Returns the value of the all text mode.  Values set in the format config will override global values.
+   * @return The value of allTextMode
+   */
+  private boolean allTextMode() {
+    // only enable all text mode if we aren't using embedded content mode.
+    if (config.getAllTextMode() == null) {
+      return embeddedContent == null && fragmentContext.getOptions().getOption(ExecConstants.JSON_READER_ALL_TEXT_MODE_VALIDATOR);

Review comment:
       @jnturton 
   Thanks for the comments.  I followed the pattern you suggested and now the precedence is SYSTEM < FORMAT < SESSION.  One thing to note is that I don't think you can override the session scope.  Drill doesn't seem to have the distinction between format and query.  But... this follows the pattern from parquet so at least it is consistent.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] cgivre merged pull request #2494: DRILL-8167: Add JSON Config Options to Format Config

Posted by GitBox <gi...@apache.org>.
cgivre merged pull request #2494:
URL: https://github.com/apache/drill/pull/2494


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org