You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Chetan Khatri <ck...@gmail.com> on 2016/10/18 07:43:58 UTC

About Error while reading large JSON file in Spark

Hello Community members,

I am getting error while reading large JSON file in spark,

*Code:*

val landingVisitor =
sqlContext.read.json("s3n://hist-ngdp/lvisitor/lvisitor-01-aug.json")

*Error:*

16/10/18 07:30:30 ERROR Executor: Exception in task 8.0 in stage 0.0 (TID 8)
java.io.IOException: Too many bytes before newline: 2147483648
at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:249)
at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)
at
org.apache.hadoop.mapred.LineRecordReader.<init>(LineRecordReader.java:135)
at
org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:237)

What would be resolution for the same ?

Thanks in Advance !


-- 
Yours Aye,
Chetan Khatri.

Re: About Error while reading large JSON file in Spark

Posted by Steve Loughran <st...@hortonworks.com>.

On 18 Oct 2016, at 10:58, Chetan Khatri <ck...@gmail.com>> wrote:

Dear Xi shen,

Thank you for getting back to question.

The approach i am following are as below:
I have MSSQL server as Enterprise data lack.

1. run Java jobs and generated JSON files, every file is almost 6 GB.
Correct spark need every JSON on separate line, so i did
sed -e 's/}/}\n/g' -s old-file.json > new-file.json
to get every json element on separate lines.
2. uploaded to s3 bucket and reading from their using sqlContext.read.json() function, where i am getting above error.

Note: If i am running for small size files then i am not getting this error where JSON elements are almost same structured.

Current approach:

* splitting large JSON(6 GB) to 1-1 GB then will process.

Note: Machine size is , 1 master and 2 slave, each 4 vcore, 26 GB RAM

I see what you are trying to do here: one JSON file per line, then splitting by line so that you can parallelise JSON processing, as well as holding many JSON objects in a single s3 file. This is a devious little trick. It just doesn't work once the json files goes > 2^31 bytes long, as the code to split by line breaks.

You could write your own input splitter which actually does basic Json parsing, splitting up by looking for the final } in a JSON clause (harder than you think, as you need to remember how many {} clauses you have entered and not include escaped "{" in strings.

a quick google shows some that may be a good starting point

https://github.com/Pivotal-Field-Engineering/pmr-common/blob/master/PivotalMRCommon/src/main/java/com/gopivotal/mapreduce/lib/input/JsonInputFormat.java
https://github.com/alexholmes/json-mapreduce

Re: About Error while reading large JSON file in Spark

Posted by Chetan Khatri <ck...@gmail.com>.

Dear Xi shen,

Thank you for getting back to question.

The approach i am following are as below:
I have MSSQL server as Enterprise data lack.

1. run Java jobs and generated JSON files, every file is almost 6 GB.
*Correct spark need every JSON on **separate line, so i did *
sed -e 's/}/}\n/g' -s old-file.json > new-file.json
to get every json element on separate lines.
2. uploaded to s3 bucket and reading from their using
sqlContext.read.json() function, where i am getting above error.

Note: If i am running for small size files then i am not getting this error
where JSON elements are almost same structured.

*Current approach:*

   -  splitting large JSON(6 GB) to 1-1 GB then will process.

Note: Machine size is , 1 master and 2 slave, each 4 vcore, 26 GB RAM

Thanks.

On Tue, Oct 18, 2016 at 2:50 PM, Xi Shen <da...@gmail.com> wrote:

> It is a plain Java IO error. Your line is too long. You should alter your
> JSON schema, so each line is a small JSON object.
>
> Please do not concatenate all the object into an array, then write the
> array in one line. You will have difficulty handling your super large JSON
> array in Spark anyway.
>
> Because one array is one object, it cannot be split into multiple
> partition.
>
>
> On Tue, Oct 18, 2016 at 3:44 PM Chetan Khatri <ck...@gmail.com>
> wrote:
>
>> Hello Community members,
>>
>> I am getting error while reading large JSON file in spark,
>>
>> *Code:*
>>
>> val landingVisitor = sqlContext.read.json("s3n://
>> hist-ngdp/lvisitor/lvisitor-01-aug.json")
>>
>> *Error:*
>>
>> 16/10/18 07:30:30 ERROR Executor: Exception in task 8.0 in stage 0.0 (TID
>> 8)
>> java.io.IOException: Too many bytes before newline: 2147483648
>> at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:249)
>> at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)
>> at org.apache.hadoop.mapred.LineRecordReader.<init>(
>> LineRecordReader.java:135)
>> at org.apache.hadoop.mapred.TextInputFormat.getRecordReader(
>> TextInputFormat.java:67)
>> at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:237)
>>
>> What would be resolution for the same ?
>>
>> Thanks in Advance !
>>
>>
>> --
>> Yours Aye,
>> Chetan Khatri.
>>
>> --
>
>
> Thanks,
> David S.
>

-- 
Yours Aye,
Chetan Khatri.
M.+91 76666 80574
Data Science Researcher
INDIA

Statement of Confidentiality
————————————————————————————
The contents of this e-mail message and any attachments are confidential
and are intended solely for addressee. The information may also be legally
privileged. This transmission is sent in trust, for the sole purpose of
delivery to the intended recipient. If you have received this transmission
in error, any use, reproduction or dissemination of this transmission is
strictly prohibited. If you are not the intended recipient, please
immediately notify the sender by reply e-mail or phone and delete this
message and its attachments, if any.

Re: About Error while reading large JSON file in Spark

Posted by Xi Shen <da...@gmail.com>.

It is a plain Java IO error. Your line is too long. You should alter your
JSON schema, so each line is a small JSON object.

Please do not concatenate all the object into an array, then write the
array in one line. You will have difficulty handling your super large JSON
array in Spark anyway.

Because one array is one object, it cannot be split into multiple partition.

On Tue, Oct 18, 2016 at 3:44 PM Chetan Khatri <ck...@gmail.com>
wrote:

> Hello Community members,
>
> I am getting error while reading large JSON file in spark,
>
> *Code:*
>
> val landingVisitor =
> sqlContext.read.json("s3n://hist-ngdp/lvisitor/lvisitor-01-aug.json")
>
> *Error:*
>
> 16/10/18 07:30:30 ERROR Executor: Exception in task 8.0 in stage 0.0 (TID
> 8)
> java.io.IOException: Too many bytes before newline: 2147483648
> at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:249)
> at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)
> at
> org.apache.hadoop.mapred.LineRecordReader.<init>(LineRecordReader.java:135)
> at
> org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
> at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:237)
>
> What would be resolution for the same ?
>
> Thanks in Advance !
>
>
> --
> Yours Aye,
> Chetan Khatri.
>
> --

Thanks,
David S.

Re: About Error while reading large JSON file in Spark

Posted by Steve Loughran <st...@hortonworks.com>.

On 18 Oct 2016, at 08:43, Chetan Khatri <ck...@gmail.com>> wrote:

Hello Community members,

I am getting error while reading large JSON file in spark,


the underlying read code can't handle more than 2^31 bytes in a single line:

    if (bytesConsumed > Integer.MAX_VALUE) {
      throw new IOException("Too many bytes before newline: " + bytesConsumed);
    }

That's because it's trying to split work by line, and of course, there aren't lines

you need to move over to reading the JSON by other means, i'm afraid. At a guess, something involving SparkContext.binaryFiles() streaming the data straight into a JSON parser,



Code:

val landingVisitor = sqlContext.read.json("s3n://hist-ngdp/lvisitor/lvisitor-01-aug.json")

unrelated, but use s3a if you can. It's better, you know.


Error:

16/10/18 07:30:30 ERROR Executor: Exception in task 8.0 in stage 0.0 (TID 8)
java.io.IOException: Too many bytes before newline: 2147483648
at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:249)
at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)
at org.apache.hadoop.mapred.LineRecordReader.<init>(LineRecordReader.java:135)
at org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:237)

What would be resolution for the same ?

Thanks in Advance !


--
Yours Aye,
Chetan Khatri.