You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Wangjianfei <10...@qq.com> on 2016/10/20 00:44:37 UTC

回复： Why the json file used by sparkSession.read.json must be a validjson object per line

yeah, the design mainly because hdfs.
 
------------------




        中国科学院软件研究所2015级硕士研究生    王建飞        电话： 15101549787






 




------------------ 原始邮件 ------------------
发件人: "Jakob Odersky"<ja...@odersky.com>; 
发送时间: 2016年10月20日(星期四) 凌晨4:46
收件人: "Hyukjin Kwon"<gu...@gmail.com>; 
抄送: "Daniel Barclay"<da...@gmail.com>; "Koert Kuipers"<ko...@tresata.com>; "user @spark"<us...@spark.apache.org>; "Wangjianfei"<10...@qq.com>; 
主题: Re: Why the json file used by sparkSession.read.json must be a validjson object per line



Another reason I could imagine is that files are often read from HDFS,
which by default uses line terminators to separate records.

It is possible to implement your own hdfs delimiter finder, however
for arbitrary json data, finding that delimiter would require stateful
parsing of the file and would be difficult to parallelize across a
cluster.

On Tue, Oct 18, 2016 at 4:40 PM, Hyukjin Kwon <gu...@gmail.com> wrote:
> Regarding his recent PR[1], I guess he meant multiple line json.
>
> As far as I know, single line json also conplies the standard. I left a
> comment with RFC in the PR but please let me know if I am wrong at any
> point.
>
> Thanks!
>
> [1]https://github.com/apache/spark/pull/15511
>
>
> On 19 Oct 2016 7:00 a.m., "Daniel Barclay" <da...@gmail.com>
> wrote:
>>
>> Koert,
>>
>> Koert Kuipers wrote:
>>
>> A single json object would mean for most parsers it needs to fit in memory
>> when reading or writing
>>
>> Note that codlife didn't seem to being asking about single-object JSON
>> files, but about standard-format JSON files.
>>
>>
>> On Oct 15, 2016 11:09, "codlife" <10...@qq.com> wrote:
>>>
>>> Hi:
>>>    I'm doubt about the design of spark.read.json,  why the json file is
>>> not
>>> a standard json file, who can tell me the internal reason. Any advice is
>>> appreciated.
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Why-the-json-file-used-by-sparkSession-read-json-must-be-a-valid-json-object-per-line-tp27907.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>>
>>
>