You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by "Bode, Meikel, NMA-CFD" <Me...@Bertelsmann.de> on 2021/01/29 03:19:04 UTC

Strange behavior with "bigger" JSON file

Hi all,

I process a lot of JSON files of different sizes. All files share the same overall structure. I have no issue with files of sizes around 150-300MB.
Another file of around 530MB now causes errors when I apply selectExpr on the resulting DF after reading the file.

AnalysisException: cannot resolve '`entity`' given input columns: [_corrupt_record]; line 1 pos 6;
'Project ['LOWER('entity) AS entity#668391, 'extraction_model, 'part, 'pipeline_run_id, 'timestamp]
+- Relation[_corrupt_record#668389] json

The schema of the read DF looks like:


root

 |-- _corrupt_record: string (nullable = true)

I analyzed the file and it contains all requested columns. Actually I can fully parse it using ijson, as I thought the issue might relate syntax errors.

Any hints on the _corrupt_record?

Thanks
Meikel