You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@drill.apache.org by "Paul Rogers (JIRA)" <ji...@apache.org> on 2017/11/10 01:32:00 UTC

[jira] [Created] (DRILL-5950) Allow JSON files to be splittable - for sequence of objects format

Paul Rogers created DRILL-5950:
----------------------------------

             Summary: Allow JSON files to be splittable - for sequence of objects format
                 Key: DRILL-5950
                 URL: https://issues.apache.org/jira/browse/DRILL-5950
             Project: Apache Drill
          Issue Type: Improvement
    Affects Versions: 1.12.0
            Reporter: Paul Rogers


The JSON plugin format is not currently splittable. This means that every JSON file must be read by only a single thread. By contrast, text files are splittable.

The key barrier to allowing JSON files to be splittable is the lack of a good mechanism to find the start of a record at some arbitrary point in the file. Text readers handle this by scanning forward looking for (say) the newline that separates records. (Though this process can be thrown off if a newline appears in a quoted value, and the start quote appears before the split point.)

However, as was discovered in a previous JSON fix, Drill's form of JSON does provide the tools. In standard JSON, a list of records must be stuctured as a list:

{code}
[ { text: "first record"},
  { text: "second record"},
  ...
  { text: "final record" }
]
{code}

In this form, it is impossible to find the start of a record without parsing from the first character onwards.

But, Drill uses a common, but non-standard, JSON structure that dispenses with the array and the commas between records:

{code}
{ text: "first record" }
{ text: "second record" }
...
{ text: "last record" }
{code}

This form does unambiguously allow finding the start of the record. Simply scan until we find the tokens: &#123;, &#125;, possibly separated by white space. That sequence is not valid JSON and only occurs between records in the sequence-of-records format.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)