You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "Vitalii Diravka (Jira)" <ji...@apache.org> on 2022/05/05 12:07:00 UTC
[jira] [Resolved] (DRILL-8204) Allow Provided Schema for HTTP Plugin in JSON Mode

     [ https://issues.apache.org/jira/browse/DRILL-8204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Vitalii Diravka resolved DRILL-8204.
------------------------------------
    Resolution: Fixed

> Allow Provided Schema for HTTP Plugin in JSON Mode
> --------------------------------------------------
>
>                 Key: DRILL-8204
>                 URL: https://issues.apache.org/jira/browse/DRILL-8204
>             Project: Apache Drill
>          Issue Type: Improvement
>          Components: Storage - Other
>    Affects Versions: 1.20.0
>            Reporter: Charles Givre
>            Assignee: Charles Givre
>            Priority: Major
>             Fix For: 2.0.0
>
>
> One of the challenges of querying APIs is inconsistent data. Drill allows you to provide a schema for individual endpoints. You can do this in one of two ways: either by 
> providing a serialized TupleMetadata of the desired schema. This is an advanced functionality and should only be used by advanced Drill users.
> The schema provisioning currently supports complex types of Arrays and Maps at any nesting level.
> ### Example Schema Provisioning:
> ```json
> "jsonOptions": {
> "providedSchema": [
> {
> "fieldName": "int_field",
> "fieldType": "bigint"
> }, {
> "fieldName": "jsonField",
> "fieldType": "varchar",
> "properties": {
> "drill.json-mode":"json"
> }
> },{
> // Array field
> "fieldName": "stringField",
> "fieldType": "varchar",
> "isArray": true
> }, {
> // Map field
> "fieldName": "mapField",
> "fieldType": "map",
> "fields": [
> {
> "fieldName": "nestedField",
> "fieldType": "int"
> },{
> "fieldName": "nestedField2",
> "fieldType": "varchar"
> }
> ]
> }
> ]
> }
> ```
> ### Example Provisioning the Schema with a JSON String
> ```json
> "jsonOptions": {
> "jsonSchema": "\{\"type\":\"tuple_schema\",\"columns\":[{\"name\":\"outer_map\",\"type\":\"STRUCT<`int_field` BIGINT, `int_array` ARRAY<BIGINT>>\",\"mode\":\"REQUIRED\"}]}"
> }
> ```
> You can print out a JSON string of a schema with the Java code below. 
> ```java
> TupleMetadata schema = new SchemaBuilder()
> .addNullable("a", MinorType.BIGINT)
> .addNullable("m", MinorType.VARCHAR)
> .build();
> ColumnMetadata m = schema.metadata("m");
> m.setProperty(JsonLoader.JSON_MODE, JsonLoader.JSON_LITERAL_MODE);
> System.out.println(schema.jsonString());
> ```
> This will generate something like the JSON string below:
> ```json
> {
> "type":"tuple_schema",
> "columns":[
> {"name":"a","type":"BIGINT","mode":"OPTIONAL"},
> {"name":"m","type":"VARCHAR","mode":"OPTIONAL","properties":\{"drill.json-mode":"json"}
> }
> ]
> }
> ```
> ## Dealing With Inconsistent Schemas
> One of the major challenges of interacting with JSON data is when the schema is inconsistent. Drill has a `UNION` data type which is marked as experimental. At the time of
> writing, the HTTP plugin does not support the `UNION`, however supplying a schema can solve a lot of those issues.
> ### Json Mode
> Drill offers the option of reading all JSON values as a string. While this can complicate downstream analytics, it can also be a more memory-efficient way of reading data with 
> inconsistent schema. Unfortunately, at the time of writing, JSON-mode is only available with a provided schema. However, future work will allow this mode to be enabled for 
> any JSON data.
> #### Enabling JSON Mode:
> You can enable JSON mode simply by adding the `drill.json-mode` property with a value of `json` to a field, as shown below:
> ```json
> {
> "fieldName": "jsonField",
> "fieldType": "varchar",
> "properties": {
> "drill.json-mode": "json"
> }
> }
> ```



--
This message was sent by Atlassian Jira
(v8.20.7#820007)