You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@drill.apache.org by "Paul Rogers (JIRA)" <ji...@apache.org> on 2017/11/09 22:03:00 UTC

[jira] [Created] (DRILL-5949) JSON format options should be part of plugin config; not session options

Paul Rogers created DRILL-5949:
----------------------------------

             Summary: JSON format options should be part of plugin config; not session options
                 Key: DRILL-5949
                 URL: https://issues.apache.org/jira/browse/DRILL-5949
             Project: Apache Drill
          Issue Type: Improvement
    Affects Versions: 1.12.0
            Reporter: Paul Rogers


Drill provides a JSON record reader. Drill provides two ways to configure this reader:

* Using the JSON plugin configuration.
* Using a set of session options.

The plugin configuration defines the file suffix associated with JSON files. The session options are:

* {{store.json.all_text_mode}}
* {{store.json.read_numbers_as_double}}
* {{store.json.reader.skip_invalid_records}}
* {{store.json.reader.print_skipped_invalid_record_number}}

Suppose I have to JSON files from different sources (and keep them in distinct directories.) For the one, I want to use {{all_text_mode}} off as the data is nicely formatted. Also, my numbers are fine, so I want {{read_numbers_as_double}} off.

But, the other file is a mess and uses a rather ad-hoc format. So, I want these two options turned on.

As it turns out I often query both files. Today, I must set the session options one way to query my "clean" file, then reverse them to query the "dirty" file.

Next, I want to join the two files. How do I set the options one way for the "clean" file, and the other for the "dirty" file within the *same query*? Can't.

Now, consider the text format plugin that can read CSV, TSV, PSV and so on. It has a variety of options. But, the are *not* session options; they are instead options in the plugin definition. This allows me to, say, have a plugin config for CSV-with-headers files that I get from source A, and a different plugin config for my CSV-without-headers files from source B.

Suppose we applied the text reader technique to the JSON reader. We'd move the session options listed above into the JSON format plugin. Then, I can define one plugin for my "clean" files, and a different plugin config for my "dirty" files.

What's more, I can then use table functions to adjust the format for each file as needed within a single query. Since table functions are part of a query, I can add them to a view that I define for the various JSON files.

The result is a far simpler user experience than the tedium of resetting session options for every query.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)