You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "Paul Rogers (Jira)" <ji...@apache.org> on 2020/02/24 00:45:00 UTC
[jira] [Created] (DRILL-7598) PostgreSQL-like functions for working with JSON

Paul Rogers created DRILL-7598:
----------------------------------

             Summary: PostgreSQL-like functions for working with JSON
                 Key: DRILL-7598
                 URL: https://issues.apache.org/jira/browse/DRILL-7598
             Project: Apache Drill
          Issue Type: Improvement
    Affects Versions: 1.17.0
            Reporter: Paul Rogers


From a contributor on the Drill user mailing list:
{quote}PostgreSQL has a practical way to manipulate the json data. You can read: [https://www.postgresql.org/docs/12/datatype-json.html].

{quote}
The user's use case is as follows:
{code:json}
{"a":"horses","b":"28","c":{"c1":"black","c2":"blue"}}
{"a":"rabbit","b":"14","c":{"c1":"green" ,"c4":"vanilla"}}
{"a":"cow"  ,"b":"28","c":{"c1":"blue" ,"c3":"black" ,"c5":{"d":"2","e":"3"}}}
{code}

Notice that the {{`c`}} column changes types. This causes Drill to fail in execution. Hence the suggestion to work with column {{c}} as JSON without parsing that JSON into Drill's relational schema.

Drill should offer such support. We've recently discussed introducing a similar feature in Drill which one could, with some humor, call "let JSON be JSON." The idea would be, as in PostreSQL, to simply represent JSON as text and allow the user to work with JSON using JSON-oriented functions. The PostreSQL link suggest that this is, in fact, a workable approach (though, as you not, doing so is slower than converting JSON to a relational structure.)

Today, however, Drill attempts to map JSON into a relational model so that the user can use [SQL operations to work on the data|https://drill.apache.org/docs/json-data-model/]. The Drill approach works well when the JSON is the output of a relational model (a dump of a relational table or query, say.) The approach does not work for "native" JSON in all its complexity. JSON is a superset of the relational model and so not all JSON files map to tables and columns.

To solve the user's use case, Drill would need to adopt a solution similar to PostgreSQL. In fact, Drill already has some of the pieces (such as the [CONVERT_TO/CONVERT_FROM operations|https://drill.apache.org/docs/data-type-conversion/#convert_to-and-convert_from]), but even these attempt to convert JSON to or from the relational model. What we need, so solve the general use case, are the kind of native JSON functions which PostgreSQL provides.

Fortunately, since Drill would store JSON as a VARCHAR, no work would be needed in the Drill "core". All that is needed is someone to provide a set of Drill functions (UDFs) to call out to some JSON library to perform the desired operations.

This feature would work best when the user can parse some parts of a JSON input file into relational structure, others as JSON. (This is the use case which the user list user faced.) So, we need a way to do that. See DRILL-7597 for a request for such a feature.

Combining the PostgreSQL-like JSON functions with the ability to read selected columns as JSON, might provide an elegant solution to the "messy JSON" problem.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)