You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Kumaresh AK <ku...@nielsen.com> on 2019/09/12 17:03:38 UTC

Exception when reading multiline JSON file

Hello Spark Community!
I am new to Spark. I tried to read a multiline json file (has around 2M
records and gzip size is about 2GB) and encountered an exception. It works
if I convert the same file into jsonl before reading it via spark.
Unfortunately the file is private and I cannot share it. Is there any
information that can help me narrow it down?
I tried writing to parquet and json. Both face the same exception. This is
the short form of the code:

    df = spark.read.option('multiline', 'true').json("file.json.gz") \
            .select(explode("objects")).select("col.*")

    df.write.mode('overwrite').json(p)
    # this crashes too: df.write.mode('overwrite').parquet(p)


Json file is of the form:
{

"version":"xyz",

"objects":[

{

"id":"abc",

....

},

{

"id": "def",

....

},

....

]

}


This is the exception:

> 2019-09-12 14:02:01,515 WARN scheduler.TaskSetManager: Lost task 0.0 in
> stage 1.0 (TID 1, 172.21.0.3, executor 0): org.apache.spark.SparkException:
> Task failed while writing rows.
>         at
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:257)
>         at
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170)
>         at
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169)
>         at
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>         at org.apache.spark.scheduler.Task.run(Task.scala:121)
>         at
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:403)
>         at
> org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
>         at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:409)
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.IllegalArgumentException: Cannot grow BufferHolder by
> size 16 because the size after growing exceeds size limitation 2147483632
>         at
> org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder.grow(BufferHolder.java:71)
>         at
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.grow(UnsafeWriter.java:62)
>         at
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.writeUnalignedBytes(UnsafeWriter.java:126)
>         at
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:109)
>         at
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.writeFields_0_0$(Unknown
> Source)
>         at
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.writeFields_1_6$(Unknown
> Source)
>         at
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
> Source)
>         at
> org.apache.spark.sql.execution.datasources.FileFormat$$anon$1$$anonfun$apply$1.apply(FileFormat.scala:149)
>         at
> org.apache.spark.sql.execution.datasources.FileFormat$$anon$1$$anonfun$apply$1.apply(FileFormat.scala:148)
>         at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
>         at
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.next(FileScanRDD.scala:104)
>         at
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
> Source)
>         at
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>         at
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
>         at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>         at
> scala.collection.Iterator$JoinIterator.hasNext(Iterator.scala:212)
>         at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
>         at
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown
> Source)
>         at
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>         at
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
>         at
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:244)
>         at
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:242)
>         at
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394)
>         at
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:248)
>         ... 10 more
>
> 2019-09-12 14:02:01,518 INFO scheduler.TaskSetManager: Starting task 0.1
> in stage 1.0 (TID 2, 172.21.0.3, executor 0, partition 0, PROCESS_LOCAL,
> 8335 bytes)
> 2019-09-12 14:03:59,016 WARN scheduler.TaskSetManager: Lost task 0.1 in
> stage 1.0 (TID 2, 172.21.0.3, executor 0): org.apache.spark.SparkException:
> Task failed while writing rows.
>         at
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:257)
>         at
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170)
>         at
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169)
>         at
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>         at org.apache.spark.scheduler.Task.run(Task.scala:121)
>         at
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:403)
>         at
> org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
>         at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:409)
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.IllegalArgumentException: Cannot grow BufferHolder by
> size 16 because the size after growing exceeds size limitation 2147483632
>         at
> org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder.grow(BufferHolder.java:71)
>         at
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.grow(UnsafeWriter.java:62)
>         at
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.writeUnalignedBytes(UnsafeWriter.java:126)
>         at
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:109)
>         at
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.writeFields_0_0$(Unknown
> Source)
>         at
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.writeFields_1_6$(Unknown
> Source)
>         at
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
> Source)
>         at
> org.apache.spark.sql.execution.datasources.FileFormat$$anon$1$$anonfun$apply$1.apply(FileFormat.scala:149)
>         at
> org.apache.spark.sql.execution.datasources.FileFormat$$anon$1$$anonfun$apply$1.apply(FileFormat.scala:148)
>         at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
>         at
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.next(FileScanRDD.scala:104)
>         at
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
> Source)
>         at
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>         at
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
>         at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>         at
> scala.collection.Iterator$JoinIterator.hasNext(Iterator.scala:212)
>         at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
>         at
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown
> Source)
>         at
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>         at
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
>         at
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:244)
>         at
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:242)
>         at
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394)
>         at
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:248)
>         ... 10 more
>

Re: Exception when reading multiline JSON file

Posted by Kevin Mellott <ke...@gmail.com>.

Hi Kumaresh,

This is most likely an issue with the size of your Spark cluster not being
large enough to accomplish the desired task. Hints for this type of
situation are when the stack trace mentions things like a size limitation
was exceeded and you lost a node.

However, this is also a great opportunity for you to learn how to monitor a
Spark job for performance considerations. There is guidance on the Spark
website that will get you started, and you will want to dig into the Spark
cluster UI and driver log files to see information in action. Databricks
also makes situations like this easier to troubleshoot because you can
execute the Spark code one step at a time using their visual notebook
experience.

Hope that helps point you in the right direction.

https://spark.apache.org/docs/latest/monitoring.html
https://m.youtube.com/watch?v=KscZf1y97m8

Kevin

On Thu, Sep 12, 2019 at 12:04 PM Kumaresh AK <ku...@nielsen.com>
wrote:

> Hello Spark Community!
> I am new to Spark. I tried to read a multiline json file (has around 2M
> records and gzip size is about 2GB) and encountered an exception. It works
> if I convert the same file into jsonl before reading it via spark.
> Unfortunately the file is private and I cannot share it. Is there any
> information that can help me narrow it down?
> I tried writing to parquet and json. Both face the same exception. This is
> the short form of the code:
>
>     df = spark.read.option('multiline', 'true').json("file.json.gz") \
>             .select(explode("objects")).select("col.*")
>
>     df.write.mode('overwrite').json(p)
>     # this crashes too: df.write.mode('overwrite').parquet(p)
>
>
> Json file is of the form:
> {
>
> "version":"xyz",
>
> "objects":[
>
> {
>
> "id":"abc",
>
> ....
>
> },
>
> {
>
> "id": "def",
>
> ....
>
> },
>
> ....
>
> ]
>
> }
>
>
> This is the exception:
>
>> 2019-09-12 14:02:01,515 WARN scheduler.TaskSetManager: Lost task 0.0 in
>> stage 1.0 (TID 1, 172.21.0.3, executor 0): org.apache.spark.SparkException:
>> Task failed while writing rows.
>>         at
>> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:257)
>>         at
>> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170)
>>         at
>> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169)
>>         at
>> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>>         at org.apache.spark.scheduler.Task.run(Task.scala:121)
>>         at
>> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:403)
>>         at
>> org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
>>         at
>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:409)
>>         at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>>         at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>>         at java.lang.Thread.run(Thread.java:748)
>> Caused by: java.lang.IllegalArgumentException: Cannot grow BufferHolder
>> by size 16 because the size after growing exceeds size limitation 2147483632
>>         at
>> org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder.grow(BufferHolder.java:71)
>>         at
>> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.grow(UnsafeWriter.java:62)
>>         at
>> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.writeUnalignedBytes(UnsafeWriter.java:126)
>>         at
>> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:109)
>>         at
>> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.writeFields_0_0$(Unknown
>> Source)
>>         at
>> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.writeFields_1_6$(Unknown
>> Source)
>>         at
>> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>> Source)
>>         at
>> org.apache.spark.sql.execution.datasources.FileFormat$$anon$1$$anonfun$apply$1.apply(FileFormat.scala:149)
>>         at
>> org.apache.spark.sql.execution.datasources.FileFormat$$anon$1$$anonfun$apply$1.apply(FileFormat.scala:148)
>>         at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
>>         at
>> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.next(FileScanRDD.scala:104)
>>         at
>> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>> Source)
>>         at
>> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>>         at
>> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
>>         at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>>         at
>> scala.collection.Iterator$JoinIterator.hasNext(Iterator.scala:212)
>>         at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
>>         at
>> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown
>> Source)
>>         at
>> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>>         at
>> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
>>         at
>> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:244)
>>         at
>> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:242)
>>         at
>> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394)
>>         at
>> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:248)
>>         ... 10 more
>>
>> 2019-09-12 14:02:01,518 INFO scheduler.TaskSetManager: Starting task 0.1
>> in stage 1.0 (TID 2, 172.21.0.3, executor 0, partition 0, PROCESS_LOCAL,
>> 8335 bytes)
>> 2019-09-12 14:03:59,016 WARN scheduler.TaskSetManager: Lost task 0.1 in
>> stage 1.0 (TID 2, 172.21.0.3, executor 0): org.apache.spark.SparkException:
>> Task failed while writing rows.
>>         at
>> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:257)
>>         at
>> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170)
>>         at
>> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169)
>>         at
>> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>>         at org.apache.spark.scheduler.Task.run(Task.scala:121)
>>         at
>> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:403)
>>         at
>> org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
>>         at
>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:409)
>>         at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>>         at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>>         at java.lang.Thread.run(Thread.java:748)
>> Caused by: java.lang.IllegalArgumentException: Cannot grow BufferHolder
>> by size 16 because the size after growing exceeds size limitation 2147483632
>>         at
>> org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder.grow(BufferHolder.java:71)
>>         at
>> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.grow(UnsafeWriter.java:62)
>>         at
>> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.writeUnalignedBytes(UnsafeWriter.java:126)
>>         at
>> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:109)
>>         at
>> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.writeFields_0_0$(Unknown
>> Source)
>>         at
>> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.writeFields_1_6$(Unknown
>> Source)
>>         at
>> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>> Source)
>>         at
>> org.apache.spark.sql.execution.datasources.FileFormat$$anon$1$$anonfun$apply$1.apply(FileFormat.scala:149)
>>         at
>> org.apache.spark.sql.execution.datasources.FileFormat$$anon$1$$anonfun$apply$1.apply(FileFormat.scala:148)
>>         at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
>>         at
>> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.next(FileScanRDD.scala:104)
>>         at
>> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>> Source)
>>         at
>> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>>         at
>> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
>>         at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>>         at
>> scala.collection.Iterator$JoinIterator.hasNext(Iterator.scala:212)
>>         at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
>>         at
>> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown
>> Source)
>>         at
>> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>>         at
>> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
>>         at
>> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:244)
>>         at
>> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:242)
>>         at
>> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394)
>>         at
>> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:248)
>>         ... 10 more
>>
>