You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Reynold Xin (JIRA)" <ji...@apache.org> on 2016/11/08 07:10:58 UTC
[jira] [Reopened] (SPARK-7366) Support multi-line JSON objects
[ https://issues.apache.org/jira/browse/SPARK-7366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Reynold Xin reopened SPARK-7366:
--------------------------------
> Support multi-line JSON objects
> -------------------------------
>
> Key: SPARK-7366
> URL: https://issues.apache.org/jira/browse/SPARK-7366
> Project: Spark
> Issue Type: Improvement
> Components: Input/Output
> Reporter: Joe Halliwell
> Priority: Minor
>
> h2. Background: why the existing formats aren't enough
> The present object-per-line format for ingesting JSON data has a couple of deficiencies:
> 1. It's not itself JSON
> 2. It's often harder for humans to read
> The object-per-file format addresses these, but at a cost of producing many files which can be unwieldy.
> Since it is feasible to read and write large JSON files via streaming (and many systems do) it seems reasonable to support them directly as an input format.
> h2. Suggested approach: use a depth hint
> The key challenge is to find record boundaries without parsing the file from the start i.e. given an offset, locate a nearby boundary. In the general case this is impossible: you can't be sure you've identified the start of a top-level record without tracing back to the start of the file.
> However, if we know something more of the structure of the file i.e. maximum object depth it seems plausible that we can do better.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org