You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by codlife <10...@qq.com> on 2016/10/15 15:09:18 UTC

Why the json file used by sparkSession.read.json must be a valid json object per line

Hi:
   I'm doubt about the design of spark.read.json,  why the json file is not
a standard json file, who can tell me the internal reason. Any advice is
appreciated.



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Why-the-json-file-used-by-sparkSession-read-json-must-be-a-valid-json-object-per-line-tp27907.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

回复： Why the json file used by sparkSession.read.json must be a validjson object per line

Posted by Wangjianfei <10...@qq.com>.

yeah, the design mainly because hdfs.
 
------------------




        中国科学院软件研究所2015级硕士研究生    王建飞        电话： 15101549787






 




------------------ 原始邮件 ------------------
发件人: "Jakob Odersky"<ja...@odersky.com>; 
发送时间: 2016年10月20日(星期四) 凌晨4:46
收件人: "Hyukjin Kwon"<gu...@gmail.com>; 
抄送: "Daniel Barclay"<da...@gmail.com>; "Koert Kuipers"<ko...@tresata.com>; "user @spark"<us...@spark.apache.org>; "Wangjianfei"<10...@qq.com>; 
主题: Re: Why the json file used by sparkSession.read.json must be a validjson object per line



Another reason I could imagine is that files are often read from HDFS,
which by default uses line terminators to separate records.

It is possible to implement your own hdfs delimiter finder, however
for arbitrary json data, finding that delimiter would require stateful
parsing of the file and would be difficult to parallelize across a
cluster.

On Tue, Oct 18, 2016 at 4:40 PM, Hyukjin Kwon <gu...@gmail.com> wrote:
> Regarding his recent PR[1], I guess he meant multiple line json.
>
> As far as I know, single line json also conplies the standard. I left a
> comment with RFC in the PR but please let me know if I am wrong at any
> point.
>
> Thanks!
>
> [1]https://github.com/apache/spark/pull/15511
>
>
> On 19 Oct 2016 7:00 a.m., "Daniel Barclay" <da...@gmail.com>
> wrote:
>>
>> Koert,
>>
>> Koert Kuipers wrote:
>>
>> A single json object would mean for most parsers it needs to fit in memory
>> when reading or writing
>>
>> Note that codlife didn't seem to being asking about single-object JSON
>> files, but about standard-format JSON files.
>>
>>
>> On Oct 15, 2016 11:09, "codlife" <10...@qq.com> wrote:
>>>
>>> Hi:
>>>    I'm doubt about the design of spark.read.json,  why the json file is
>>> not
>>> a standard json file, who can tell me the internal reason. Any advice is
>>> appreciated.
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Why-the-json-file-used-by-sparkSession-read-json-must-be-a-valid-json-object-per-line-tp27907.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>>
>>
>

Re: Why the json file used by sparkSession.read.json must be a valid json object per line

Posted by Steve Loughran <st...@hortonworks.com>.

> On 19 Oct 2016, at 21:46, Jakob Odersky <ja...@odersky.com> wrote:
> 
> Another reason I could imagine is that files are often read from HDFS,
> which by default uses line terminators to separate records.
> 
> It is possible to implement your own hdfs delimiter finder, however
> for arbitrary json data, finding that delimiter would require stateful
> parsing of the file and would be difficult to parallelize across a
> cluster.
> 

good point. 

If you are creating your own files of a list of JSON files, then you could do your own encoding, one with say a header for each record (say 'J'+'S'+'O'+'N' + int64 length, and split on that: you don't need to scan a record to know its length, and you can scan a large document counting its records simply though a sequence of skip + read(byte[8]) operations.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: Why the json file used by sparkSession.read.json must be a valid json object per line

Posted by Jakob Odersky <ja...@odersky.com>.

Another reason I could imagine is that files are often read from HDFS,
which by default uses line terminators to separate records.

It is possible to implement your own hdfs delimiter finder, however
for arbitrary json data, finding that delimiter would require stateful
parsing of the file and would be difficult to parallelize across a
cluster.

On Tue, Oct 18, 2016 at 4:40 PM, Hyukjin Kwon <gu...@gmail.com> wrote:
> Regarding his recent PR[1], I guess he meant multiple line json.
>
> As far as I know, single line json also conplies the standard. I left a
> comment with RFC in the PR but please let me know if I am wrong at any
> point.
>
> Thanks!
>
> [1]https://github.com/apache/spark/pull/15511
>
>
> On 19 Oct 2016 7:00 a.m., "Daniel Barclay" <da...@gmail.com>
> wrote:
>>
>> Koert,
>>
>> Koert Kuipers wrote:
>>
>> A single json object would mean for most parsers it needs to fit in memory
>> when reading or writing
>>
>> Note that codlife didn't seem to being asking about single-object JSON
>> files, but about standard-format JSON files.
>>
>>
>> On Oct 15, 2016 11:09, "codlife" <10...@qq.com> wrote:
>>>
>>> Hi:
>>>    I'm doubt about the design of spark.read.json,  why the json file is
>>> not
>>> a standard json file, who can tell me the internal reason. Any advice is
>>> appreciated.
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Why-the-json-file-used-by-sparkSession-read-json-must-be-a-valid-json-object-per-line-tp27907.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>>
>>
>

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: Why the json file used by sparkSession.read.json must be a valid json object per line

Posted by Hyukjin Kwon <gu...@gmail.com>.

Regarding his recent PR[1], I guess he meant multiple line json.

As far as I know, single line json also conplies the standard. I left a
comment with RFC in the PR but please let me know if I am wrong at any
point.

Thanks!

[1]https://github.com/apache/spark/pull/15511

On 19 Oct 2016 7:00 a.m., "Daniel Barclay" <da...@gmail.com>
wrote:

> Koert,
>
> Koert Kuipers wrote:
>
> A single json object would mean for most parsers it needs to fit in memory
> when reading or writing
>
> Note that codlife didn't seem to being asking about *single-object* JSON
> files, but about *standard-format* JSON files.
>
>
> On Oct 15, 2016 11:09, "codlife" <10...@qq.com> wrote:
>
>> Hi:
>>    I'm doubt about the design of spark.read.json,  why the json file is
>> not
>> a standard json file, who can tell me the internal reason. Any advice is
>> appreciated.
>>
>>
>>
>> --
>> View this message in context: http://apache-spark-user-list.
>> 1001560.n3.nabble.com/Why-the-json-file-used-by-sparkSession
>> -read-json-must-be-a-valid-json-object-per-line-tp27907.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>
>>
>

Re: Why the json file used by sparkSession.read.json must be a valid json object per line

Posted by Daniel Barclay <da...@gmail.com>.

Koert,

Koert Kuipers wrote:
>
> A single json object would mean for most parsers it needs to fit in memory when reading or writing
>
Note that codlife didn't seem to being asking about /single-object/ JSON files, but about /standard-format/ JSON files.
>
> On Oct 15, 2016 11:09, "codlife" <1004910847@qq.com <ma...@qq.com>> wrote:
>
>     Hi:
>        I'm doubt about the design of spark.read.json,  why the json file is not
>     a standard json file, who can tell me the internal reason. Any advice is
>     appreciated.
>
>
>
>     --
>     View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Why-the-json-file-used-by-sparkSession-read-json-must-be-a-valid-json-object-per-line-tp27907.html <http://apache-spark-user-list.1001560.n3.nabble.com/Why-the-json-file-used-by-sparkSession-read-json-must-be-a-valid-json-object-per-line-tp27907.html>
>     Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
>     ---------------------------------------------------------------------
>     To unsubscribe e-mail: user-unsubscribe@spark.apache.org <ma...@spark.apache.org>
>

Re: Why the json file used by sparkSession.read.json must be a valid json object per line

Posted by Koert Kuipers <ko...@tresata.com>.

A single json object would mean for most parsers it needs to fit in memory
when reading or writing

On Oct 15, 2016 11:09, "codlife" <10...@qq.com> wrote:

> Hi:
>    I'm doubt about the design of spark.read.json,  why the json file is not
> a standard json file, who can tell me the internal reason. Any advice is
> appreciated.
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Why-the-json-file-used-by-
> sparkSession-read-json-must-be-a-valid-json-object-per-line-tp27907.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>