You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@drill.apache.org by GitBox <gi...@apache.org> on 2020/03/12 23:13:16 UTC

[GitHub] [drill] paul-rogers opened a new pull request #2023: DRILL-7640: EVF-based JSON Loader

paul-rogers opened a new pull request #2023: DRILL-7640: EVF-based JSON Loader
URL: https://github.com/apache/drill/pull/2023
 
 
   # [DRILL-7640](https://issues.apache.org/jira/browse/DRILL-7640): EVF-based JSON Loader
   
   ## Description
   
   Builds on the JSON structure parser and several other PRs to provide an enhanced, robust mechanism to read JSON data into value vectors via the EVF. This is not the JSON reader, rather it is the "V2" version of the JsonProcessor which does the actual JSON parsing/loading work.
   
   This PR is a partial do-over of an earlier PR, DRILL-6953. This PR contains only the lower-level JSON loader. A new PR for DRILL-6953 will follow which will add the JSON reader and fix any compatibility issues. Doing the PR separately reduces the size of each PR, making them easier to review and manage.
   
   The concept is that we already have the JSON structure parser (thanks to those that reviewed those PRs!). The structure parser emits *events* to *listeners*. This PR implements the listeners which use EVF to write values to value vectors.
   
   Some new features provided by the JSON loader include:
   
   * Built in conversion from (almost) any JSON type to (almost) any other type. If a field starts a String, and shifts to Number, Drill will continue to write the value as `VARCHAR`.
   * Allow runs of nulls (`null`) and/or empty arrays (`[ ]`). Defer type selection until the first actual value appears. (Or, force selection of `Nullable VARCHAR` if the batch ends without seeing any type.)
   * An array with null entries `[ null ]` forces type selection to `Nullable VARCHAR` since we must count the null entries.
   * Support for a provided schema. If the input is ambiguous, or inconsistent, the provided schema states the desired column type; the JSON loader converts data to that type.
   * The provided schema allows "text mode" selection per-column. The JSON loader supports the "all text mode" setting from before, but now allows "text mode" per column for only those columns which are a problem.
   * The provided schema also allows a new "JSON mode": the field, and all its children, are read as JSON text. That is, if the JSON is `{a: {b: ["foo", 10, true]}}`, and column `a` is read as JSON, then the value of `a` is `"{b: ["foo", 10, true]}"`.
   
   We can expect a number of tweaks and adjustments to be needed in a later PR when we get the existing tests to pass with the new JSON format plugin. The goal here is to simply get the bulk of the work reviewed separately.
   
   To be clear, in this PR, nothing other than unit tests uses this new code.
   
   This PR contains a few changes which duplicate those in the PR for DRILL-7633. Once DRILL-7633 is merged, this PR will be rebased and the overlapping changes will disappear.
   
   ## Documentation
   
   Once the JSON format plugin PR is submitted, users can use the provided column properties mentioned above. This PR does not yet expose them, so no documentation is needed yet.
   
   ## Testing
   
   Added unit tests for all of the cases and features described above. Also tests failure cases (such as JSON with inconsistent types which the JSON loader cannot handle.)

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [drill] vvysotskyi commented on a change in pull request #2023: DRILL-7640: EVF-based JSON Loader

Posted by GitBox <gi...@apache.org>.
vvysotskyi commented on a change in pull request #2023: DRILL-7640: EVF-based JSON Loader
URL: https://github.com/apache/drill/pull/2023#discussion_r392402158
 
 

 ##########
 File path: exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/json/loader/JsonLoaderOptions.java
 ##########
 @@ -0,0 +1,44 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.store.easy.json.loader;
+
+import org.apache.drill.exec.store.easy.json.parser.JsonStructureOptions;
+
+/**
+ * Extends the {@link JsonStructureOptions} class, which provides
+ * JSON syntactic options, with a number of semantic options enforced
+ * at the JSON loader level.
+ */
+public class JsonLoaderOptions extends JsonStructureOptions {
+
+  public boolean readNumbersAsDouble;
+
+  /**
+   * Drill prior to version 1.8 would read a null string
+   * array element as the string "null". Drill 1.8 and later
 
 Review comment:
   ```suggestion
      * array element as the string "null". Drill 1.18 and later
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [drill] vvysotskyi commented on a change in pull request #2023: DRILL-7640: EVF-based JSON Loader

Posted by GitBox <gi...@apache.org>.
vvysotskyi commented on a change in pull request #2023: DRILL-7640: EVF-based JSON Loader
URL: https://github.com/apache/drill/pull/2023#discussion_r392384985
 
 

 ##########
 File path: exec/java-exec/src/main/java/org/apache/drill/exec/physical/resultSet/impl/TupleState.java
 ##########
 @@ -92,6 +92,153 @@
 public abstract class TupleState extends ContainerState
   implements AbstractTupleWriter.TupleWriterListener {
 
+  /**
+   * The set of columns added via the writers: includes both projected
+   * and unprojected columns. (The writer is free to add columns that the
+   * query does not project; the result set loader creates a dummy column
+   * and dummy writer, then does not project the column to the output.)
+   */
+  protected final List<ColumnState> columns = new ArrayList<>();
+
+  /**
+   * Internal writer schema that matches the column list.
+   */
+  protected final TupleMetadata schema = new TupleSchema();
+
+  /**
+   * Metadata description of the output container (for the row) or map
+   * (for map or repeated map.)
+   * <p>
+   * Rows and maps have an output schema which may differ from the internal schema.
+   * The output schema excludes unprojected columns. It also excludes
+   * columns added in an overflow row.
+   * <p>
+   * The output schema is built slightly differently for maps inside a
+   * union vs. normal top-level (or nested) maps. Maps inside a union do
+   * not defer columns because of the muddy semantics (and infrequent use)
+   * of unions.
+   */
+  protected TupleMetadata outputSchema;
+
+  private int prevHarvestIndex = -1;
+
+  protected TupleState(LoaderInternals events,
+      ResultVectorCache vectorCache,
+      ProjectionFilter projectionSet) {
+    super(events, vectorCache, projectionSet);
+  }
+
+  protected void bindOutputSchema(TupleMetadata outputSchema) {
+    this.outputSchema = outputSchema;
+  }
+
+  /**
+   * Returns an ordered set of the columns which make up the tuple.
+   * Column order is the same as that defined by the map's schema,
+   * to allow indexed access. New columns always appear at the end
+   * of the list to preserve indexes.
+   *
+   * @return ordered list of column states for the columns within
+   * this tuple
+   */
+  public List<ColumnState> columns() { return columns; }
+
+  public TupleMetadata schema() { return writer().tupleSchema(); }
+
+  public abstract AbstractTupleWriter writer();
+
+  @Override
+  public boolean isProjected(String colName) {
+    return projectionSet.isProjected(colName);
+  }
+
+  @Override
+  public ObjectWriter addColumn(TupleWriter tupleWriter, MaterializedField column) {
+    return addColumn(tupleWriter, MetadataUtils.fromField(column));
+  }
+
+  @Override
+  public ObjectWriter addColumn(TupleWriter tupleWriter, ColumnMetadata columnSchema) {
+    return BuildFromSchema.instance().buildColumn(this, columnSchema);
+  }
+
+  @Override
+  protected void addColumn(ColumnState colState) {
+    columns.add(colState);
+  }
+
+  public boolean hasProjections() {
+    for (final ColumnState colState : columns) {
+      if (colState.isProjected()) {
+        return true;
+      }
+    }
+    return false;
+  }
+
+  @Override
+  protected Collection<ColumnState> columnStates() {
+    return columns;
+  }
+
+  protected void updateOutput(int curSchemaVersion) {
+
+    // Scan all columns
+    for (int i = 0; i < columns.size(); i++) {
+      final ColumnState colState = columns.get(i);
+
+      // Ignore unprojected columns
+      if (! colState.writer().isProjected()) {
+        continue;
+      }
+
+      // If this is a new column added since the lastoutput, then we may have
 
 Review comment:
   ```suggestion
         // If this is a new column added since the last output, then we may have
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [drill] vvysotskyi commented on a change in pull request #2023: DRILL-7640: EVF-based JSON Loader

Posted by GitBox <gi...@apache.org>.
vvysotskyi commented on a change in pull request #2023: DRILL-7640: EVF-based JSON Loader
URL: https://github.com/apache/drill/pull/2023#discussion_r392402074
 
 

 ##########
 File path: exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/json/loader/JsonLoaderOptions.java
 ##########
 @@ -0,0 +1,44 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.store.easy.json.loader;
+
+import org.apache.drill.exec.store.easy.json.parser.JsonStructureOptions;
+
+/**
+ * Extends the {@link JsonStructureOptions} class, which provides
+ * JSON syntactic options, with a number of semantic options enforced
+ * at the JSON loader level.
+ */
+public class JsonLoaderOptions extends JsonStructureOptions {
+
+  public boolean readNumbersAsDouble;
+
+  /**
+   * Drill prior to version 1.8 would read a null string
 
 Review comment:
   ```suggestion
      * Drill prior to version 1.18 would read a null string
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [drill] vvysotskyi commented on a change in pull request #2023: DRILL-7640: EVF-based JSON Loader

Posted by GitBox <gi...@apache.org>.
vvysotskyi commented on a change in pull request #2023: DRILL-7640: EVF-based JSON Loader
URL: https://github.com/apache/drill/pull/2023#discussion_r392390290
 
 

 ##########
 File path: exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/json/loader/JsonLoader.java
 ##########
 @@ -0,0 +1,70 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.store.easy.json.loader;
+
+import org.apache.drill.exec.record.metadata.ColumnMetadata;
+
+/**
+ * Enhanced second-generation JSON loader which takes an input
+ * source and creates a series of record batches using the
+ * {@link ResultSetLoader} abstraction.
+ */
+public interface JsonLoader {
+
+  /**
+   * Column property specific to the JSON loader.
+   * Mode for reading Varchar columns from JSON. One of:
+   * <li>
+   * <li>{@link #JSON_TYPED_MODE}: Read using normal typing rules
+   * (default).</li>
+   * <li>{@link #JSON_TEXT_MODE}: Like the JSON format plugin's
+   * "all-text mode", but for a single column. That JSON field is
+   * read as text regardless of the actual value. Applies only to
+   * scalars.</li>
+   * <li>{@link JSON_LITERAL_MODE}: Causes the field, and all its
 
 Review comment:
   ```suggestion
      * <li>{@link #JSON_LITERAL_MODE}: Causes the field, and all its
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [drill] paul-rogers commented on a change in pull request #2023: DRILL-7640: EVF-based JSON Loader

Posted by GitBox <gi...@apache.org>.
paul-rogers commented on a change in pull request #2023: DRILL-7640: EVF-based JSON Loader
URL: https://github.com/apache/drill/pull/2023#discussion_r392558379
 
 

 ##########
 File path: exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/json/loader/JsonLoaderImpl.java
 ##########
 @@ -0,0 +1,341 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.store.easy.json.loader;
+
+import java.io.IOException;
+import java.io.InputStream;
+import java.util.ArrayList;
+import java.util.List;
+
+import org.apache.drill.common.exceptions.CustomErrorContext;
+import org.apache.drill.common.exceptions.UserException;
+import org.apache.drill.exec.physical.resultSet.ResultSetLoader;
+import org.apache.drill.exec.physical.resultSet.RowSetLoader;
+import org.apache.drill.exec.record.metadata.ColumnMetadata;
+import org.apache.drill.exec.record.metadata.TupleMetadata;
+import org.apache.drill.exec.store.easy.json.parser.ErrorFactory;
+import org.apache.drill.exec.store.easy.json.parser.JsonStructureParser;
+import org.apache.drill.exec.store.easy.json.parser.ValueDef;
+import org.apache.drill.exec.store.easy.json.parser.ValueDef.JsonType;
+import org.apache.drill.exec.vector.accessor.UnsupportedConversionError;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import com.esri.core.geometry.JsonReader;
+import com.fasterxml.jackson.core.JsonParseException;
+import com.fasterxml.jackson.core.JsonToken;
+
+/**
+ * Revised JSON loader that is based on the
+ * {@link ResultSetLoader} abstraction. Uses the listener-based
+ * {@link JsonStructureParser} to walk the JSON tree in a "streaming"
+ * fashion, calling events which this class turns into vector write
+ * operations. Listeners handle options such as all text mode
+ * vs. type-specific parsing. Think of this implementation as a
+ * listener-based recursive-descent parser.
+ * <p>
+ * The JSON loader mechanism runs two state machines intertwined:
+ * <ol>
+ * <li>The actual parser (to parse each JSON object, array or scalar according
+ * to its inferred type represented by the {@code JsonStructureParser}.</li>
+ * <li>The type discovery machine, which is made complex because JSON may include
+ * long runs of nulls, represented by this class.</li>
+ * </ol>
+ *
+ * <h4>Schema Discovery</h4>
+ *
+ * Fields are discovered on the fly. Types are inferred from the first JSON token
+ * for a field. Type inference is less than perfect: it cannot handle type changes
+ * such as first seeing 10, then 12.5, or first seeing "100", then 200.
+ * <p>
+ * When a field first contains null or an empty list, "null deferral" logic
+ * adds a special state that "waits" for an actual data type to present itself.
+ * This allows the parser to handle a series of nulls, empty arrays, or arrays
+ * of nulls (when using lists) at the start of the file. If no type ever appears,
+ * the loader forces the field to "text mode", hoping that the field is scalar.
+ * <p>
+ * To slightly help the null case, if the projection list shows that a column
+ * must be an array or a map, then that information is used to guess the type
+ * of a null column.
+ * <p>
+ * The code includes a prototype mechanism to provide type hints for columns.
+ * At present, it is just used to handle nulls that are never "resolved" by the
+ * end of a batch. Would be much better to use the hints (or a full schema) to
+ * avoid the huge mass of code needed to handle nulls.
+ *
+ * <h4>Provided Schema</h4>
+ *
+ * The JSON loader accepts a provided schema which removes type ambiguities.
+ * If we have the examples above (runs of nulls, or shifting types), then the
+ * provided schema says the vector type to create; the individual column listeners
+ * attempt to convert the JSON token type to the target vector type. The result
+ * is that, if the schema provides the correct type, the loader can ride over
+ * ambiguities in the input.
+ *
+ * <h4>Comparison to Original JSON Reader</h4>
+ *
+ * This class replaces the {@link JsonReader} class used in Drill versions 1.12
+ * and before. Compared with the previous version, this implementation:
+ * <ul>
+ * <li>Materializes parse states as classes rather than as methods and
+ * boolean flags as in the prior version.</li>
+ * <li>Reports errors as {@link UserException} objects, complete with context
+ * information, rather than as generic Java exception as in the prior version.</li>
+ * <li>Moves parse options into a separate {@link JsonOptions} class.</li>
+ * <li>Iteration protocol is simpler: simply call {@link #next()} until it returns
+ * {@code false}. Errors are reported out-of-band via an exception.</li>
+ * <li>The result set loader abstraction is perfectly happy with an empty schema.
+ * For this reason, this version (unlike the original) does not make up a dummy
+ * column if the schema would otherwise be empty.</li>
+ * <li>Projection pushdown is handled by the {@link ResultSetLoader} rather than
+ * the JSON loader. This class always creates a vector writer, but the result set
+ * loader will return a dummy (no-op) writer for non-projected columns.</li>
+ * <li>Like the original version, this version "free wheels" over unprojected objects
+ * and arrays; watching only for matching brackets, but ignoring all else.</li>
+ * <li>Writes boolean values as SmallInt values, rather than as bits in the
+ * prior version.</li>
+ * <li>This version also "free-wheels" over all unprojected values. If the user
+ * finds that they have inconsistent data in some field f, then the user can
+ * project fields except f; Drill will ignore the inconsistent values in f.</li>
+ * <li>Because of this free-wheeling capability, this version does not need a
+ * "counting" reader; this same reader handles the case in which no fields are
+ * projected for {@code SELECT COUNT(*)} queries.</li>
+ * <li>Runs of null values result in a "deferred null state" that patiently
+ * waits for an actual value token to appear, and only then "realizes" a parse
+ * state for that type.</li>
+ * <li>Provides the same limited error recovery as the original version. See
+ * <a href="https://issues.apache.org/jira/browse/DRILL-4653">DRILL-4653</a>
+ * and
+ * <a href="https://issues.apache.org/jira/browse/DRILL-5953">DRILL-5953</a>.
+ * </li>
+ * </ul>
+ */
+public class JsonLoaderImpl implements JsonLoader, ErrorFactory {
+  protected static final Logger logger = LoggerFactory.getLogger(JsonLoaderImpl.class);
+
+  interface NullTypeMarker {
+    void forceResolution();
+  }
+
+  private final ResultSetLoader rsLoader;
+  private final JsonLoaderOptions options;
+  private final CustomErrorContext errorContext;
+  private final TupleListener rowListener;
+  private final JsonStructureParser parser;
+  private boolean eof;
+
+  /**
+   * List of "unknown" columns (have only seen nulls or empty values)
+   * that are waiting for resolution, or forced resolution at the end
+   * of a batch. Unknown columns occur only when using dynamic type
+   * inference, and not JSON tokens have been seen which would hint
+   * at a type. Not needed when a schema is provided.
+   */
+
+  // Using a simple list. Won't perform well if we have hundreds of
+  // null fields; but then we've never seen such a pathologically bad
+  // case... Usually just one or two fields have deferred nulls.
+  private final List<NullTypeMarker> nullStates = new ArrayList<>();
+
+  public JsonLoaderImpl(ResultSetLoader rsLoader, TupleMetadata providedSchema,
+      JsonLoaderOptions options, CustomErrorContext errorContext,
+      InputStream stream) {
+    this.rsLoader = rsLoader;
+    this.options = options;
+    this.errorContext = errorContext;
+    this.rowListener = new TupleListener(this, rsLoader.writer(), providedSchema);
+    this.parser = new JsonStructureParser(stream, options, rowListener, this);
+  }
+
+  public JsonLoaderOptions options() { return options; }
+
+  @Override // JsonLoader
+  public boolean next() {
+    if (eof) {
+      return false;
+    }
+    rsLoader.startBatch();
+    RowSetLoader rowWriter = rsLoader.writer();
+    while (rowWriter.start()) {
+      if (parser.next()) {
+        rowWriter.save();
+      } else {
+        eof = true;
+        break;
+      }
+    }
+    return rsLoader.hasRows();
+  }
+
+  public void addNullMarker(JsonLoaderImpl.NullTypeMarker marker) {
+    nullStates.add(marker);
+  }
+
+  public void removeNullMarker(JsonLoaderImpl.NullTypeMarker marker) {
+    nullStates.remove(marker);
+  }
+
+  /**
+   * Finish reading a batch of data. We may have pending "null" columns:
+   * a column for which we've seen only nulls, or an array that has
+   * always been empty. The batch needs to finish, and needs a type,
+   * but we still don't know the type. Since we must decide on one,
+   * we do the following guess Varchar, and switch to text mode.
+   * <p>
+   * This choices is not perfect. Switching to text mode means
+   * results will vary
+   * from run to run depending on the order that we see empty and
+   * non-empty values for this column. Plus, since the system is
+   * distributed, the decision made here may conflict with that made in
+   * some other fragment.
+   * <p>
+   * The only real solution is for the user to provide a schema.
+   * <p>
+   * Bottom line: the user is responsible for not giving Drill
+   * ambiguous data that would require Drill to predict the future.
+   */
+  @Override // JsonLoader
+  public void endBatch() {
+    List<NullTypeMarker> copy = new ArrayList<>();
+    copy.addAll(nullStates);
+    for (NullTypeMarker marker : copy) {
+      marker.forceResolution();
+    }
 
 Review comment:
   Resolution removes items from the list, which is why we assert the list is empty at the exit of the method.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [drill] paul-rogers commented on issue #2023: DRILL-7640: EVF-based JSON Loader

Posted by GitBox <gi...@apache.org>.
paul-rogers commented on issue #2023: DRILL-7640: EVF-based JSON Loader
URL: https://github.com/apache/drill/pull/2023#issuecomment-599016065
 
 
   Rebased on master to resolve conflict.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [drill] asfgit closed pull request #2023: DRILL-7640: EVF-based JSON Loader

Posted by GitBox <gi...@apache.org>.
asfgit closed pull request #2023: DRILL-7640: EVF-based JSON Loader
URL: https://github.com/apache/drill/pull/2023
 
 
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [drill] vvysotskyi commented on a change in pull request #2023: DRILL-7640: EVF-based JSON Loader

Posted by GitBox <gi...@apache.org>.
vvysotskyi commented on a change in pull request #2023: DRILL-7640: EVF-based JSON Loader
URL: https://github.com/apache/drill/pull/2023#discussion_r392401642
 
 

 ##########
 File path: exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/json/loader/JsonLoaderImpl.java
 ##########
 @@ -0,0 +1,341 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.store.easy.json.loader;
+
+import java.io.IOException;
+import java.io.InputStream;
+import java.util.ArrayList;
+import java.util.List;
+
+import org.apache.drill.common.exceptions.CustomErrorContext;
+import org.apache.drill.common.exceptions.UserException;
+import org.apache.drill.exec.physical.resultSet.ResultSetLoader;
+import org.apache.drill.exec.physical.resultSet.RowSetLoader;
+import org.apache.drill.exec.record.metadata.ColumnMetadata;
+import org.apache.drill.exec.record.metadata.TupleMetadata;
+import org.apache.drill.exec.store.easy.json.parser.ErrorFactory;
+import org.apache.drill.exec.store.easy.json.parser.JsonStructureParser;
+import org.apache.drill.exec.store.easy.json.parser.ValueDef;
+import org.apache.drill.exec.store.easy.json.parser.ValueDef.JsonType;
+import org.apache.drill.exec.vector.accessor.UnsupportedConversionError;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import com.esri.core.geometry.JsonReader;
+import com.fasterxml.jackson.core.JsonParseException;
+import com.fasterxml.jackson.core.JsonToken;
+
+/**
+ * Revised JSON loader that is based on the
+ * {@link ResultSetLoader} abstraction. Uses the listener-based
+ * {@link JsonStructureParser} to walk the JSON tree in a "streaming"
+ * fashion, calling events which this class turns into vector write
+ * operations. Listeners handle options such as all text mode
+ * vs. type-specific parsing. Think of this implementation as a
+ * listener-based recursive-descent parser.
+ * <p>
+ * The JSON loader mechanism runs two state machines intertwined:
+ * <ol>
+ * <li>The actual parser (to parse each JSON object, array or scalar according
+ * to its inferred type represented by the {@code JsonStructureParser}.</li>
+ * <li>The type discovery machine, which is made complex because JSON may include
+ * long runs of nulls, represented by this class.</li>
+ * </ol>
+ *
+ * <h4>Schema Discovery</h4>
+ *
+ * Fields are discovered on the fly. Types are inferred from the first JSON token
+ * for a field. Type inference is less than perfect: it cannot handle type changes
+ * such as first seeing 10, then 12.5, or first seeing "100", then 200.
+ * <p>
+ * When a field first contains null or an empty list, "null deferral" logic
+ * adds a special state that "waits" for an actual data type to present itself.
+ * This allows the parser to handle a series of nulls, empty arrays, or arrays
+ * of nulls (when using lists) at the start of the file. If no type ever appears,
+ * the loader forces the field to "text mode", hoping that the field is scalar.
+ * <p>
+ * To slightly help the null case, if the projection list shows that a column
+ * must be an array or a map, then that information is used to guess the type
+ * of a null column.
+ * <p>
+ * The code includes a prototype mechanism to provide type hints for columns.
+ * At present, it is just used to handle nulls that are never "resolved" by the
+ * end of a batch. Would be much better to use the hints (or a full schema) to
+ * avoid the huge mass of code needed to handle nulls.
+ *
+ * <h4>Provided Schema</h4>
+ *
+ * The JSON loader accepts a provided schema which removes type ambiguities.
+ * If we have the examples above (runs of nulls, or shifting types), then the
+ * provided schema says the vector type to create; the individual column listeners
+ * attempt to convert the JSON token type to the target vector type. The result
+ * is that, if the schema provides the correct type, the loader can ride over
+ * ambiguities in the input.
+ *
+ * <h4>Comparison to Original JSON Reader</h4>
+ *
+ * This class replaces the {@link JsonReader} class used in Drill versions 1.12
+ * and before. Compared with the previous version, this implementation:
+ * <ul>
+ * <li>Materializes parse states as classes rather than as methods and
+ * boolean flags as in the prior version.</li>
+ * <li>Reports errors as {@link UserException} objects, complete with context
+ * information, rather than as generic Java exception as in the prior version.</li>
+ * <li>Moves parse options into a separate {@link JsonOptions} class.</li>
+ * <li>Iteration protocol is simpler: simply call {@link #next()} until it returns
+ * {@code false}. Errors are reported out-of-band via an exception.</li>
+ * <li>The result set loader abstraction is perfectly happy with an empty schema.
+ * For this reason, this version (unlike the original) does not make up a dummy
+ * column if the schema would otherwise be empty.</li>
+ * <li>Projection pushdown is handled by the {@link ResultSetLoader} rather than
+ * the JSON loader. This class always creates a vector writer, but the result set
+ * loader will return a dummy (no-op) writer for non-projected columns.</li>
+ * <li>Like the original version, this version "free wheels" over unprojected objects
+ * and arrays; watching only for matching brackets, but ignoring all else.</li>
+ * <li>Writes boolean values as SmallInt values, rather than as bits in the
+ * prior version.</li>
+ * <li>This version also "free-wheels" over all unprojected values. If the user
+ * finds that they have inconsistent data in some field f, then the user can
+ * project fields except f; Drill will ignore the inconsistent values in f.</li>
+ * <li>Because of this free-wheeling capability, this version does not need a
+ * "counting" reader; this same reader handles the case in which no fields are
+ * projected for {@code SELECT COUNT(*)} queries.</li>
+ * <li>Runs of null values result in a "deferred null state" that patiently
+ * waits for an actual value token to appear, and only then "realizes" a parse
+ * state for that type.</li>
+ * <li>Provides the same limited error recovery as the original version. See
+ * <a href="https://issues.apache.org/jira/browse/DRILL-4653">DRILL-4653</a>
+ * and
+ * <a href="https://issues.apache.org/jira/browse/DRILL-5953">DRILL-5953</a>.
+ * </li>
+ * </ul>
+ */
+public class JsonLoaderImpl implements JsonLoader, ErrorFactory {
+  protected static final Logger logger = LoggerFactory.getLogger(JsonLoaderImpl.class);
+
+  interface NullTypeMarker {
+    void forceResolution();
+  }
+
+  private final ResultSetLoader rsLoader;
+  private final JsonLoaderOptions options;
+  private final CustomErrorContext errorContext;
+  private final TupleListener rowListener;
+  private final JsonStructureParser parser;
+  private boolean eof;
+
+  /**
+   * List of "unknown" columns (have only seen nulls or empty values)
+   * that are waiting for resolution, or forced resolution at the end
+   * of a batch. Unknown columns occur only when using dynamic type
+   * inference, and not JSON tokens have been seen which would hint
+   * at a type. Not needed when a schema is provided.
+   */
+
+  // Using a simple list. Won't perform well if we have hundreds of
+  // null fields; but then we've never seen such a pathologically bad
+  // case... Usually just one or two fields have deferred nulls.
+  private final List<NullTypeMarker> nullStates = new ArrayList<>();
+
+  public JsonLoaderImpl(ResultSetLoader rsLoader, TupleMetadata providedSchema,
+      JsonLoaderOptions options, CustomErrorContext errorContext,
+      InputStream stream) {
+    this.rsLoader = rsLoader;
+    this.options = options;
+    this.errorContext = errorContext;
+    this.rowListener = new TupleListener(this, rsLoader.writer(), providedSchema);
+    this.parser = new JsonStructureParser(stream, options, rowListener, this);
+  }
+
+  public JsonLoaderOptions options() { return options; }
+
+  @Override // JsonLoader
+  public boolean next() {
+    if (eof) {
+      return false;
+    }
+    rsLoader.startBatch();
+    RowSetLoader rowWriter = rsLoader.writer();
+    while (rowWriter.start()) {
+      if (parser.next()) {
+        rowWriter.save();
+      } else {
+        eof = true;
+        break;
+      }
+    }
+    return rsLoader.hasRows();
+  }
+
+  public void addNullMarker(JsonLoaderImpl.NullTypeMarker marker) {
+    nullStates.add(marker);
+  }
+
+  public void removeNullMarker(JsonLoaderImpl.NullTypeMarker marker) {
+    nullStates.remove(marker);
+  }
+
+  /**
+   * Finish reading a batch of data. We may have pending "null" columns:
+   * a column for which we've seen only nulls, or an array that has
+   * always been empty. The batch needs to finish, and needs a type,
+   * but we still don't know the type. Since we must decide on one,
+   * we do the following guess Varchar, and switch to text mode.
+   * <p>
+   * This choices is not perfect. Switching to text mode means
+   * results will vary
+   * from run to run depending on the order that we see empty and
+   * non-empty values for this column. Plus, since the system is
+   * distributed, the decision made here may conflict with that made in
+   * some other fragment.
+   * <p>
+   * The only real solution is for the user to provide a schema.
+   * <p>
+   * Bottom line: the user is responsible for not giving Drill
+   * ambiguous data that would require Drill to predict the future.
+   */
+  @Override // JsonLoader
+  public void endBatch() {
+    List<NullTypeMarker> copy = new ArrayList<>();
+    copy.addAll(nullStates);
+    for (NullTypeMarker marker : copy) {
+      marker.forceResolution();
+    }
+    assert nullStates.isEmpty();
+  }
+
+  @Override // JsonLoader
+  public void close() {
+    parser.close();
+  }
+
+  @Override // ErrorFactory
+  public RuntimeException parseError(String msg, JsonParseException e) {
+    throw buildError(
+        UserException.dataReadError(e)
+          .addContext(msg));
+  }
+
+  @Override // ErrorFactory
+  public RuntimeException ioException(IOException e) {
+    throw buildError(
+        UserException.dataReadError(e)
+          .addContext(errorContext));
+  }
+
+  @Override // ErrorFactory
+  public RuntimeException structureError(String msg) {
+    throw buildError(
+        UserException.dataReadError()
+          .message(msg));
+  }
+
+  @Override // ErrorFactory
+  public RuntimeException syntaxError(JsonParseException e) {
+    throw buildError(
+        UserException.dataReadError(e)
+          .addContext("Syntax error"));
+  }
+
+  @Override // ErrorFactory
+  public RuntimeException typeError(UnsupportedConversionError e) {
+    throw buildError(
+        UserException.validationError(e)
+          .addContext("Unsupported type conversion"));
+  }
+
+  @Override // ErrorFactory
+  public RuntimeException syntaxError(JsonToken token) {
+    throw buildError(
+        UserException.dataReadError()
+          .addContext("Syntax error on token", token.toString()));
+  }
+
+  @Override // ErrorFactory
+  public RuntimeException unrecoverableError() {
+    throw buildError(
+        UserException.dataReadError()
+          .addContext("Unrecoverable syntax error on token")
+          .addContext("Recovery attempts", parser.recoverableErrorCount()));
+  }
+
+  protected UserException typeConversionError(ColumnMetadata schema, ValueDef valueDef) {
+    String type = valueDef.type().name().toLowerCase();
+    if (valueDef.isArray()) {
+      for (int i = 0; i < valueDef.dimensions(); i++) {
+        type += "[]";
 
 Review comment:
   Please use StringBuilder.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [drill] vvysotskyi commented on a change in pull request #2023: DRILL-7640: EVF-based JSON Loader

Posted by GitBox <gi...@apache.org>.
vvysotskyi commented on a change in pull request #2023: DRILL-7640: EVF-based JSON Loader
URL: https://github.com/apache/drill/pull/2023#discussion_r392389728
 
 

 ##########
 File path: exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/json/loader/JsonLoader.java
 ##########
 @@ -0,0 +1,70 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.store.easy.json.loader;
+
+import org.apache.drill.exec.record.metadata.ColumnMetadata;
+
+/**
+ * Enhanced second-generation JSON loader which takes an input
+ * source and creates a series of record batches using the
+ * {@link ResultSetLoader} abstraction.
 
 Review comment:
   ```suggestion
    * {@link org.apache.drill.exec.physical.resultSet.ResultSetLoader} abstraction.
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [drill] vvysotskyi commented on a change in pull request #2023: DRILL-7640: EVF-based JSON Loader

Posted by GitBox <gi...@apache.org>.
vvysotskyi commented on a change in pull request #2023: DRILL-7640: EVF-based JSON Loader
URL: https://github.com/apache/drill/pull/2023#discussion_r392390666
 
 

 ##########
 File path: exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/json/loader/JsonLoader.java
 ##########
 @@ -0,0 +1,70 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.store.easy.json.loader;
+
+import org.apache.drill.exec.record.metadata.ColumnMetadata;
+
+/**
+ * Enhanced second-generation JSON loader which takes an input
+ * source and creates a series of record batches using the
+ * {@link ResultSetLoader} abstraction.
+ */
+public interface JsonLoader {
+
+  /**
+   * Column property specific to the JSON loader.
+   * Mode for reading Varchar columns from JSON. One of:
+   * <li>
+   * <li>{@link #JSON_TYPED_MODE}: Read using normal typing rules
+   * (default).</li>
+   * <li>{@link #JSON_TEXT_MODE}: Like the JSON format plugin's
+   * "all-text mode", but for a single column. That JSON field is
+   * read as text regardless of the actual value. Applies only to
+   * scalars.</li>
+   * <li>{@link JSON_LITERAL_MODE}: Causes the field, and all its
+   * children, to be read as literal JSON: the values are returned
+   * as a valid JSON string.</li>
+   * </li>
+   */
+  String JSON_MODE = ColumnMetadata.DRILL_PROP_PREFIX + "json-mode";
+  String JSON_TEXT_MODE = "text";
+  String JSON_TYPED_MODE = "typed";
+  String JSON_LITERAL_MODE = "json";
+
+  /**
+   * Read one record of data.
+   *
+   * @return {@code true} if a record was loaded, {@code false} if EOF.
+   * @throws UserException for most errors
 
 Review comment:
   ```suggestion
      * @throws org.apache.drill.common.exceptions.UserException for most errors
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [drill] vvysotskyi commented on a change in pull request #2023: DRILL-7640: EVF-based JSON Loader

Posted by GitBox <gi...@apache.org>.
vvysotskyi commented on a change in pull request #2023: DRILL-7640: EVF-based JSON Loader
URL: https://github.com/apache/drill/pull/2023#discussion_r392385068
 
 

 ##########
 File path: exec/java-exec/src/main/java/org/apache/drill/exec/physical/resultSet/impl/TupleState.java
 ##########
 @@ -92,6 +92,153 @@
 public abstract class TupleState extends ContainerState
   implements AbstractTupleWriter.TupleWriterListener {
 
+  /**
+   * The set of columns added via the writers: includes both projected
+   * and unprojected columns. (The writer is free to add columns that the
+   * query does not project; the result set loader creates a dummy column
+   * and dummy writer, then does not project the column to the output.)
+   */
+  protected final List<ColumnState> columns = new ArrayList<>();
+
+  /**
+   * Internal writer schema that matches the column list.
+   */
+  protected final TupleMetadata schema = new TupleSchema();
+
+  /**
+   * Metadata description of the output container (for the row) or map
+   * (for map or repeated map.)
+   * <p>
+   * Rows and maps have an output schema which may differ from the internal schema.
+   * The output schema excludes unprojected columns. It also excludes
+   * columns added in an overflow row.
+   * <p>
+   * The output schema is built slightly differently for maps inside a
+   * union vs. normal top-level (or nested) maps. Maps inside a union do
+   * not defer columns because of the muddy semantics (and infrequent use)
+   * of unions.
+   */
+  protected TupleMetadata outputSchema;
+
+  private int prevHarvestIndex = -1;
+
+  protected TupleState(LoaderInternals events,
+      ResultVectorCache vectorCache,
+      ProjectionFilter projectionSet) {
+    super(events, vectorCache, projectionSet);
+  }
+
+  protected void bindOutputSchema(TupleMetadata outputSchema) {
+    this.outputSchema = outputSchema;
+  }
+
+  /**
+   * Returns an ordered set of the columns which make up the tuple.
+   * Column order is the same as that defined by the map's schema,
+   * to allow indexed access. New columns always appear at the end
+   * of the list to preserve indexes.
+   *
+   * @return ordered list of column states for the columns within
+   * this tuple
+   */
+  public List<ColumnState> columns() { return columns; }
+
+  public TupleMetadata schema() { return writer().tupleSchema(); }
+
+  public abstract AbstractTupleWriter writer();
+
+  @Override
+  public boolean isProjected(String colName) {
+    return projectionSet.isProjected(colName);
+  }
+
+  @Override
+  public ObjectWriter addColumn(TupleWriter tupleWriter, MaterializedField column) {
+    return addColumn(tupleWriter, MetadataUtils.fromField(column));
+  }
+
+  @Override
+  public ObjectWriter addColumn(TupleWriter tupleWriter, ColumnMetadata columnSchema) {
+    return BuildFromSchema.instance().buildColumn(this, columnSchema);
+  }
+
+  @Override
+  protected void addColumn(ColumnState colState) {
+    columns.add(colState);
+  }
+
+  public boolean hasProjections() {
+    for (final ColumnState colState : columns) {
+      if (colState.isProjected()) {
+        return true;
+      }
+    }
+    return false;
+  }
+
+  @Override
+  protected Collection<ColumnState> columnStates() {
+    return columns;
+  }
+
+  protected void updateOutput(int curSchemaVersion) {
+
+    // Scan all columns
+    for (int i = 0; i < columns.size(); i++) {
+      final ColumnState colState = columns.get(i);
+
+      // Ignore unprojected columns
+      if (! colState.writer().isProjected()) {
+        continue;
+      }
+
+      // If this is a new column added since the lastoutput, then we may have
+      // to add the column to this output. For the row itself, and for maps
+      // outside of unions, If the column wasadded after the output schema
 
 Review comment:
   ```suggestion
         // outside of unions, If the column was added after the output schema
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [drill] vvysotskyi commented on a change in pull request #2023: DRILL-7640: EVF-based JSON Loader

Posted by GitBox <gi...@apache.org>.
vvysotskyi commented on a change in pull request #2023: DRILL-7640: EVF-based JSON Loader
URL: https://github.com/apache/drill/pull/2023#discussion_r392394571
 
 

 ##########
 File path: exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/json/loader/JsonLoaderImpl.java
 ##########
 @@ -0,0 +1,341 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.store.easy.json.loader;
+
+import java.io.IOException;
+import java.io.InputStream;
+import java.util.ArrayList;
+import java.util.List;
+
+import org.apache.drill.common.exceptions.CustomErrorContext;
+import org.apache.drill.common.exceptions.UserException;
+import org.apache.drill.exec.physical.resultSet.ResultSetLoader;
+import org.apache.drill.exec.physical.resultSet.RowSetLoader;
+import org.apache.drill.exec.record.metadata.ColumnMetadata;
+import org.apache.drill.exec.record.metadata.TupleMetadata;
+import org.apache.drill.exec.store.easy.json.parser.ErrorFactory;
+import org.apache.drill.exec.store.easy.json.parser.JsonStructureParser;
+import org.apache.drill.exec.store.easy.json.parser.ValueDef;
+import org.apache.drill.exec.store.easy.json.parser.ValueDef.JsonType;
+import org.apache.drill.exec.vector.accessor.UnsupportedConversionError;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import com.esri.core.geometry.JsonReader;
+import com.fasterxml.jackson.core.JsonParseException;
+import com.fasterxml.jackson.core.JsonToken;
+
+/**
+ * Revised JSON loader that is based on the
+ * {@link ResultSetLoader} abstraction. Uses the listener-based
+ * {@link JsonStructureParser} to walk the JSON tree in a "streaming"
+ * fashion, calling events which this class turns into vector write
+ * operations. Listeners handle options such as all text mode
+ * vs. type-specific parsing. Think of this implementation as a
+ * listener-based recursive-descent parser.
+ * <p>
+ * The JSON loader mechanism runs two state machines intertwined:
+ * <ol>
+ * <li>The actual parser (to parse each JSON object, array or scalar according
+ * to its inferred type represented by the {@code JsonStructureParser}.</li>
+ * <li>The type discovery machine, which is made complex because JSON may include
+ * long runs of nulls, represented by this class.</li>
+ * </ol>
+ *
+ * <h4>Schema Discovery</h4>
+ *
+ * Fields are discovered on the fly. Types are inferred from the first JSON token
+ * for a field. Type inference is less than perfect: it cannot handle type changes
+ * such as first seeing 10, then 12.5, or first seeing "100", then 200.
+ * <p>
+ * When a field first contains null or an empty list, "null deferral" logic
+ * adds a special state that "waits" for an actual data type to present itself.
+ * This allows the parser to handle a series of nulls, empty arrays, or arrays
+ * of nulls (when using lists) at the start of the file. If no type ever appears,
+ * the loader forces the field to "text mode", hoping that the field is scalar.
+ * <p>
+ * To slightly help the null case, if the projection list shows that a column
+ * must be an array or a map, then that information is used to guess the type
+ * of a null column.
+ * <p>
+ * The code includes a prototype mechanism to provide type hints for columns.
+ * At present, it is just used to handle nulls that are never "resolved" by the
+ * end of a batch. Would be much better to use the hints (or a full schema) to
+ * avoid the huge mass of code needed to handle nulls.
+ *
+ * <h4>Provided Schema</h4>
+ *
+ * The JSON loader accepts a provided schema which removes type ambiguities.
+ * If we have the examples above (runs of nulls, or shifting types), then the
+ * provided schema says the vector type to create; the individual column listeners
+ * attempt to convert the JSON token type to the target vector type. The result
+ * is that, if the schema provides the correct type, the loader can ride over
+ * ambiguities in the input.
+ *
+ * <h4>Comparison to Original JSON Reader</h4>
+ *
+ * This class replaces the {@link JsonReader} class used in Drill versions 1.12
+ * and before. Compared with the previous version, this implementation:
+ * <ul>
+ * <li>Materializes parse states as classes rather than as methods and
+ * boolean flags as in the prior version.</li>
+ * <li>Reports errors as {@link UserException} objects, complete with context
+ * information, rather than as generic Java exception as in the prior version.</li>
+ * <li>Moves parse options into a separate {@link JsonOptions} class.</li>
+ * <li>Iteration protocol is simpler: simply call {@link #next()} until it returns
+ * {@code false}. Errors are reported out-of-band via an exception.</li>
+ * <li>The result set loader abstraction is perfectly happy with an empty schema.
+ * For this reason, this version (unlike the original) does not make up a dummy
+ * column if the schema would otherwise be empty.</li>
+ * <li>Projection pushdown is handled by the {@link ResultSetLoader} rather than
+ * the JSON loader. This class always creates a vector writer, but the result set
+ * loader will return a dummy (no-op) writer for non-projected columns.</li>
+ * <li>Like the original version, this version "free wheels" over unprojected objects
+ * and arrays; watching only for matching brackets, but ignoring all else.</li>
+ * <li>Writes boolean values as SmallInt values, rather than as bits in the
+ * prior version.</li>
+ * <li>This version also "free-wheels" over all unprojected values. If the user
+ * finds that they have inconsistent data in some field f, then the user can
+ * project fields except f; Drill will ignore the inconsistent values in f.</li>
+ * <li>Because of this free-wheeling capability, this version does not need a
+ * "counting" reader; this same reader handles the case in which no fields are
+ * projected for {@code SELECT COUNT(*)} queries.</li>
+ * <li>Runs of null values result in a "deferred null state" that patiently
+ * waits for an actual value token to appear, and only then "realizes" a parse
+ * state for that type.</li>
+ * <li>Provides the same limited error recovery as the original version. See
+ * <a href="https://issues.apache.org/jira/browse/DRILL-4653">DRILL-4653</a>
+ * and
+ * <a href="https://issues.apache.org/jira/browse/DRILL-5953">DRILL-5953</a>.
+ * </li>
+ * </ul>
+ */
+public class JsonLoaderImpl implements JsonLoader, ErrorFactory {
+  protected static final Logger logger = LoggerFactory.getLogger(JsonLoaderImpl.class);
+
+  interface NullTypeMarker {
+    void forceResolution();
+  }
+
+  private final ResultSetLoader rsLoader;
+  private final JsonLoaderOptions options;
+  private final CustomErrorContext errorContext;
+  private final TupleListener rowListener;
+  private final JsonStructureParser parser;
+  private boolean eof;
+
+  /**
+   * List of "unknown" columns (have only seen nulls or empty values)
+   * that are waiting for resolution, or forced resolution at the end
+   * of a batch. Unknown columns occur only when using dynamic type
+   * inference, and not JSON tokens have been seen which would hint
+   * at a type. Not needed when a schema is provided.
+   */
+
+  // Using a simple list. Won't perform well if we have hundreds of
+  // null fields; but then we've never seen such a pathologically bad
+  // case... Usually just one or two fields have deferred nulls.
+  private final List<NullTypeMarker> nullStates = new ArrayList<>();
+
+  public JsonLoaderImpl(ResultSetLoader rsLoader, TupleMetadata providedSchema,
+      JsonLoaderOptions options, CustomErrorContext errorContext,
+      InputStream stream) {
+    this.rsLoader = rsLoader;
+    this.options = options;
+    this.errorContext = errorContext;
+    this.rowListener = new TupleListener(this, rsLoader.writer(), providedSchema);
+    this.parser = new JsonStructureParser(stream, options, rowListener, this);
+  }
+
+  public JsonLoaderOptions options() { return options; }
+
+  @Override // JsonLoader
+  public boolean next() {
+    if (eof) {
+      return false;
+    }
+    rsLoader.startBatch();
+    RowSetLoader rowWriter = rsLoader.writer();
+    while (rowWriter.start()) {
+      if (parser.next()) {
+        rowWriter.save();
+      } else {
+        eof = true;
+        break;
+      }
+    }
+    return rsLoader.hasRows();
+  }
+
+  public void addNullMarker(JsonLoaderImpl.NullTypeMarker marker) {
+    nullStates.add(marker);
+  }
+
+  public void removeNullMarker(JsonLoaderImpl.NullTypeMarker marker) {
+    nullStates.remove(marker);
+  }
+
+  /**
+   * Finish reading a batch of data. We may have pending "null" columns:
+   * a column for which we've seen only nulls, or an array that has
+   * always been empty. The batch needs to finish, and needs a type,
+   * but we still don't know the type. Since we must decide on one,
+   * we do the following guess Varchar, and switch to text mode.
+   * <p>
+   * This choices is not perfect. Switching to text mode means
+   * results will vary
+   * from run to run depending on the order that we see empty and
+   * non-empty values for this column. Plus, since the system is
+   * distributed, the decision made here may conflict with that made in
+   * some other fragment.
+   * <p>
+   * The only real solution is for the user to provide a schema.
+   * <p>
+   * Bottom line: the user is responsible for not giving Drill
+   * ambiguous data that would require Drill to predict the future.
+   */
+  @Override // JsonLoader
+  public void endBatch() {
+    List<NullTypeMarker> copy = new ArrayList<>();
+    copy.addAll(nullStates);
+    for (NullTypeMarker marker : copy) {
+      marker.forceResolution();
+    }
 
 Review comment:
   Is it required to wrap the list?
   ```suggestion
       for (NullTypeMarker marker : nullStates) {
         marker.forceResolution();
       }
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [drill] vvysotskyi commented on a change in pull request #2023: DRILL-7640: EVF-based JSON Loader

Posted by GitBox <gi...@apache.org>.
vvysotskyi commented on a change in pull request #2023: DRILL-7640: EVF-based JSON Loader
URL: https://github.com/apache/drill/pull/2023#discussion_r392388330
 
 

 ##########
 File path: exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/json/loader/AbstractValueListener.java
 ##########
 @@ -0,0 +1,81 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.store.easy.json.loader;
+
+import org.apache.drill.common.exceptions.UserException;
+import org.apache.drill.exec.record.metadata.ColumnMetadata;
+import org.apache.drill.exec.store.easy.json.parser.ArrayListener;
+import org.apache.drill.exec.store.easy.json.parser.ObjectListener;
+import org.apache.drill.exec.store.easy.json.parser.ValueDef;
+import org.apache.drill.exec.store.easy.json.parser.ValueListener;
+
+/**
+ * Abstract base class for value (field or array element) listeners.
+ */
+public abstract class AbstractValueListener implements ValueListener {
+
+  protected final JsonLoaderImpl loader;
+
+  public AbstractValueListener(JsonLoaderImpl loader) {
+    this.loader = loader;
+  }
+
+  @Override
+  public void bind(ValueHost host) { }
+
+  @Override
+  public void onBoolean(boolean value) {
+    throw typeConversionError("Boolean");
+  }
+
+  @Override
+  public void onInt(long value) {
+    throw typeConversionError("integer");
+  }
+
+  @Override
+  public void onFloat(double value) {
+    throw typeConversionError("float");
+  }
+
+  @Override
+  public void onString(String value) {
+    throw typeConversionError("string");
+  }
+
+  @Override
+  public void onEmbedddObject(String value) {
 
 Review comment:
   ```suggestion
     public void onEmbeddedObject(String value) {
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [drill] vvysotskyi commented on a change in pull request #2023: DRILL-7640: EVF-based JSON Loader

Posted by GitBox <gi...@apache.org>.
vvysotskyi commented on a change in pull request #2023: DRILL-7640: EVF-based JSON Loader
URL: https://github.com/apache/drill/pull/2023#discussion_r392737207
 
 

 ##########
 File path: exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/json/loader/JsonLoaderImpl.java
 ##########
 @@ -0,0 +1,341 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.store.easy.json.loader;
+
+import java.io.IOException;
+import java.io.InputStream;
+import java.util.ArrayList;
+import java.util.List;
+
+import org.apache.drill.common.exceptions.CustomErrorContext;
+import org.apache.drill.common.exceptions.UserException;
+import org.apache.drill.exec.physical.resultSet.ResultSetLoader;
+import org.apache.drill.exec.physical.resultSet.RowSetLoader;
+import org.apache.drill.exec.record.metadata.ColumnMetadata;
+import org.apache.drill.exec.record.metadata.TupleMetadata;
+import org.apache.drill.exec.store.easy.json.parser.ErrorFactory;
+import org.apache.drill.exec.store.easy.json.parser.JsonStructureParser;
+import org.apache.drill.exec.store.easy.json.parser.ValueDef;
+import org.apache.drill.exec.store.easy.json.parser.ValueDef.JsonType;
+import org.apache.drill.exec.vector.accessor.UnsupportedConversionError;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import com.esri.core.geometry.JsonReader;
+import com.fasterxml.jackson.core.JsonParseException;
+import com.fasterxml.jackson.core.JsonToken;
+
+/**
+ * Revised JSON loader that is based on the
+ * {@link ResultSetLoader} abstraction. Uses the listener-based
+ * {@link JsonStructureParser} to walk the JSON tree in a "streaming"
+ * fashion, calling events which this class turns into vector write
+ * operations. Listeners handle options such as all text mode
+ * vs. type-specific parsing. Think of this implementation as a
+ * listener-based recursive-descent parser.
+ * <p>
+ * The JSON loader mechanism runs two state machines intertwined:
+ * <ol>
+ * <li>The actual parser (to parse each JSON object, array or scalar according
+ * to its inferred type represented by the {@code JsonStructureParser}.</li>
+ * <li>The type discovery machine, which is made complex because JSON may include
+ * long runs of nulls, represented by this class.</li>
+ * </ol>
+ *
+ * <h4>Schema Discovery</h4>
+ *
+ * Fields are discovered on the fly. Types are inferred from the first JSON token
+ * for a field. Type inference is less than perfect: it cannot handle type changes
+ * such as first seeing 10, then 12.5, or first seeing "100", then 200.
+ * <p>
+ * When a field first contains null or an empty list, "null deferral" logic
+ * adds a special state that "waits" for an actual data type to present itself.
+ * This allows the parser to handle a series of nulls, empty arrays, or arrays
+ * of nulls (when using lists) at the start of the file. If no type ever appears,
+ * the loader forces the field to "text mode", hoping that the field is scalar.
+ * <p>
+ * To slightly help the null case, if the projection list shows that a column
+ * must be an array or a map, then that information is used to guess the type
+ * of a null column.
+ * <p>
+ * The code includes a prototype mechanism to provide type hints for columns.
+ * At present, it is just used to handle nulls that are never "resolved" by the
+ * end of a batch. Would be much better to use the hints (or a full schema) to
+ * avoid the huge mass of code needed to handle nulls.
+ *
+ * <h4>Provided Schema</h4>
+ *
+ * The JSON loader accepts a provided schema which removes type ambiguities.
+ * If we have the examples above (runs of nulls, or shifting types), then the
+ * provided schema says the vector type to create; the individual column listeners
+ * attempt to convert the JSON token type to the target vector type. The result
+ * is that, if the schema provides the correct type, the loader can ride over
+ * ambiguities in the input.
+ *
+ * <h4>Comparison to Original JSON Reader</h4>
+ *
+ * This class replaces the {@link JsonReader} class used in Drill versions 1.12
+ * and before. Compared with the previous version, this implementation:
+ * <ul>
+ * <li>Materializes parse states as classes rather than as methods and
+ * boolean flags as in the prior version.</li>
+ * <li>Reports errors as {@link UserException} objects, complete with context
+ * information, rather than as generic Java exception as in the prior version.</li>
+ * <li>Moves parse options into a separate {@link JsonOptions} class.</li>
+ * <li>Iteration protocol is simpler: simply call {@link #next()} until it returns
+ * {@code false}. Errors are reported out-of-band via an exception.</li>
+ * <li>The result set loader abstraction is perfectly happy with an empty schema.
+ * For this reason, this version (unlike the original) does not make up a dummy
+ * column if the schema would otherwise be empty.</li>
+ * <li>Projection pushdown is handled by the {@link ResultSetLoader} rather than
+ * the JSON loader. This class always creates a vector writer, but the result set
+ * loader will return a dummy (no-op) writer for non-projected columns.</li>
+ * <li>Like the original version, this version "free wheels" over unprojected objects
+ * and arrays; watching only for matching brackets, but ignoring all else.</li>
+ * <li>Writes boolean values as SmallInt values, rather than as bits in the
+ * prior version.</li>
+ * <li>This version also "free-wheels" over all unprojected values. If the user
+ * finds that they have inconsistent data in some field f, then the user can
+ * project fields except f; Drill will ignore the inconsistent values in f.</li>
+ * <li>Because of this free-wheeling capability, this version does not need a
+ * "counting" reader; this same reader handles the case in which no fields are
+ * projected for {@code SELECT COUNT(*)} queries.</li>
+ * <li>Runs of null values result in a "deferred null state" that patiently
+ * waits for an actual value token to appear, and only then "realizes" a parse
+ * state for that type.</li>
+ * <li>Provides the same limited error recovery as the original version. See
+ * <a href="https://issues.apache.org/jira/browse/DRILL-4653">DRILL-4653</a>
+ * and
+ * <a href="https://issues.apache.org/jira/browse/DRILL-5953">DRILL-5953</a>.
+ * </li>
+ * </ul>
+ */
+public class JsonLoaderImpl implements JsonLoader, ErrorFactory {
+  protected static final Logger logger = LoggerFactory.getLogger(JsonLoaderImpl.class);
+
+  interface NullTypeMarker {
+    void forceResolution();
+  }
+
+  private final ResultSetLoader rsLoader;
+  private final JsonLoaderOptions options;
+  private final CustomErrorContext errorContext;
+  private final TupleListener rowListener;
+  private final JsonStructureParser parser;
+  private boolean eof;
+
+  /**
+   * List of "unknown" columns (have only seen nulls or empty values)
+   * that are waiting for resolution, or forced resolution at the end
+   * of a batch. Unknown columns occur only when using dynamic type
+   * inference, and not JSON tokens have been seen which would hint
+   * at a type. Not needed when a schema is provided.
+   */
+
+  // Using a simple list. Won't perform well if we have hundreds of
+  // null fields; but then we've never seen such a pathologically bad
+  // case... Usually just one or two fields have deferred nulls.
+  private final List<NullTypeMarker> nullStates = new ArrayList<>();
+
+  public JsonLoaderImpl(ResultSetLoader rsLoader, TupleMetadata providedSchema,
+      JsonLoaderOptions options, CustomErrorContext errorContext,
+      InputStream stream) {
+    this.rsLoader = rsLoader;
+    this.options = options;
+    this.errorContext = errorContext;
+    this.rowListener = new TupleListener(this, rsLoader.writer(), providedSchema);
+    this.parser = new JsonStructureParser(stream, options, rowListener, this);
+  }
+
+  public JsonLoaderOptions options() { return options; }
+
+  @Override // JsonLoader
+  public boolean next() {
+    if (eof) {
+      return false;
+    }
+    rsLoader.startBatch();
+    RowSetLoader rowWriter = rsLoader.writer();
+    while (rowWriter.start()) {
+      if (parser.next()) {
+        rowWriter.save();
+      } else {
+        eof = true;
+        break;
+      }
+    }
+    return rsLoader.hasRows();
+  }
+
+  public void addNullMarker(JsonLoaderImpl.NullTypeMarker marker) {
+    nullStates.add(marker);
+  }
+
+  public void removeNullMarker(JsonLoaderImpl.NullTypeMarker marker) {
+    nullStates.remove(marker);
+  }
+
+  /**
+   * Finish reading a batch of data. We may have pending "null" columns:
+   * a column for which we've seen only nulls, or an array that has
+   * always been empty. The batch needs to finish, and needs a type,
+   * but we still don't know the type. Since we must decide on one,
+   * we do the following guess Varchar, and switch to text mode.
+   * <p>
+   * This choices is not perfect. Switching to text mode means
+   * results will vary
+   * from run to run depending on the order that we see empty and
+   * non-empty values for this column. Plus, since the system is
+   * distributed, the decision made here may conflict with that made in
+   * some other fragment.
+   * <p>
+   * The only real solution is for the user to provide a schema.
+   * <p>
+   * Bottom line: the user is responsible for not giving Drill
+   * ambiguous data that would require Drill to predict the future.
+   */
+  @Override // JsonLoader
+  public void endBatch() {
+    List<NullTypeMarker> copy = new ArrayList<>();
+    copy.addAll(nullStates);
+    for (NullTypeMarker marker : copy) {
+      marker.forceResolution();
+    }
 
 Review comment:
   Ok, thanks for the explanation.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [drill] vvysotskyi commented on a change in pull request #2023: DRILL-7640: EVF-based JSON Loader

Posted by GitBox <gi...@apache.org>.
vvysotskyi commented on a change in pull request #2023: DRILL-7640: EVF-based JSON Loader
URL: https://github.com/apache/drill/pull/2023#discussion_r392392054
 
 

 ##########
 File path: exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/json/loader/JsonLoaderImpl.java
 ##########
 @@ -0,0 +1,341 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.store.easy.json.loader;
+
+import java.io.IOException;
+import java.io.InputStream;
+import java.util.ArrayList;
+import java.util.List;
+
+import org.apache.drill.common.exceptions.CustomErrorContext;
+import org.apache.drill.common.exceptions.UserException;
+import org.apache.drill.exec.physical.resultSet.ResultSetLoader;
+import org.apache.drill.exec.physical.resultSet.RowSetLoader;
+import org.apache.drill.exec.record.metadata.ColumnMetadata;
+import org.apache.drill.exec.record.metadata.TupleMetadata;
+import org.apache.drill.exec.store.easy.json.parser.ErrorFactory;
+import org.apache.drill.exec.store.easy.json.parser.JsonStructureParser;
+import org.apache.drill.exec.store.easy.json.parser.ValueDef;
+import org.apache.drill.exec.store.easy.json.parser.ValueDef.JsonType;
+import org.apache.drill.exec.vector.accessor.UnsupportedConversionError;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import com.esri.core.geometry.JsonReader;
+import com.fasterxml.jackson.core.JsonParseException;
+import com.fasterxml.jackson.core.JsonToken;
+
+/**
+ * Revised JSON loader that is based on the
+ * {@link ResultSetLoader} abstraction. Uses the listener-based
+ * {@link JsonStructureParser} to walk the JSON tree in a "streaming"
+ * fashion, calling events which this class turns into vector write
+ * operations. Listeners handle options such as all text mode
+ * vs. type-specific parsing. Think of this implementation as a
+ * listener-based recursive-descent parser.
+ * <p>
+ * The JSON loader mechanism runs two state machines intertwined:
+ * <ol>
+ * <li>The actual parser (to parse each JSON object, array or scalar according
+ * to its inferred type represented by the {@code JsonStructureParser}.</li>
+ * <li>The type discovery machine, which is made complex because JSON may include
+ * long runs of nulls, represented by this class.</li>
+ * </ol>
+ *
+ * <h4>Schema Discovery</h4>
+ *
+ * Fields are discovered on the fly. Types are inferred from the first JSON token
+ * for a field. Type inference is less than perfect: it cannot handle type changes
+ * such as first seeing 10, then 12.5, or first seeing "100", then 200.
+ * <p>
+ * When a field first contains null or an empty list, "null deferral" logic
+ * adds a special state that "waits" for an actual data type to present itself.
+ * This allows the parser to handle a series of nulls, empty arrays, or arrays
+ * of nulls (when using lists) at the start of the file. If no type ever appears,
+ * the loader forces the field to "text mode", hoping that the field is scalar.
+ * <p>
+ * To slightly help the null case, if the projection list shows that a column
+ * must be an array or a map, then that information is used to guess the type
+ * of a null column.
+ * <p>
+ * The code includes a prototype mechanism to provide type hints for columns.
+ * At present, it is just used to handle nulls that are never "resolved" by the
+ * end of a batch. Would be much better to use the hints (or a full schema) to
+ * avoid the huge mass of code needed to handle nulls.
+ *
+ * <h4>Provided Schema</h4>
+ *
+ * The JSON loader accepts a provided schema which removes type ambiguities.
+ * If we have the examples above (runs of nulls, or shifting types), then the
+ * provided schema says the vector type to create; the individual column listeners
+ * attempt to convert the JSON token type to the target vector type. The result
+ * is that, if the schema provides the correct type, the loader can ride over
+ * ambiguities in the input.
+ *
+ * <h4>Comparison to Original JSON Reader</h4>
+ *
+ * This class replaces the {@link JsonReader} class used in Drill versions 1.12
 
 Review comment:
   Nit:
   ```suggestion
    * This class replaces the {@link JsonReader} class used in Drill versions 1.17
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services