You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "Paul Rogers (JIRA)" <ji...@apache.org> on 2017/11/10 01:32:00 UTC
[jira] [Created] (DRILL-5950) Allow JSON files to be splittable -
for sequence of objects format
Paul Rogers created DRILL-5950:
----------------------------------
Summary: Allow JSON files to be splittable - for sequence of objects format
Key: DRILL-5950
URL: https://issues.apache.org/jira/browse/DRILL-5950
Project: Apache Drill
Issue Type: Improvement
Affects Versions: 1.12.0
Reporter: Paul Rogers
The JSON plugin format is not currently splittable. This means that every JSON file must be read by only a single thread. By contrast, text files are splittable.
The key barrier to allowing JSON files to be splittable is the lack of a good mechanism to find the start of a record at some arbitrary point in the file. Text readers handle this by scanning forward looking for (say) the newline that separates records. (Though this process can be thrown off if a newline appears in a quoted value, and the start quote appears before the split point.)
However, as was discovered in a previous JSON fix, Drill's form of JSON does provide the tools. In standard JSON, a list of records must be stuctured as a list:
{code}
[ { text: "first record"},
{ text: "second record"},
...
{ text: "final record" }
]
{code}
In this form, it is impossible to find the start of a record without parsing from the first character onwards.
But, Drill uses a common, but non-standard, JSON structure that dispenses with the array and the commas between records:
{code}
{ text: "first record" }
{ text: "second record" }
...
{ text: "last record" }
{code}
This form does unambiguously allow finding the start of the record. Simply scan until we find the tokens: {, }, possibly separated by white space. That sequence is not valid JSON and only occurs between records in the sequence-of-records format.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)