You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Patrick <ti...@gmail.com> on 2017/07/25 14:28:43 UTC
Re: Nested JSON Handling in Spark 2.1

Hi,

I would appreciate some suggestions on how to achieve top level struct
treatment to nested JSON when stored in Parquet format. Or any other
solutions for best performance using Spark 2.1.

Thanks in advance


On Mon, Jul 24, 2017 at 4:11 PM, Patrick <ti...@gmail.com> wrote:

> To avoid confusion, the query i am referring above is over some numeric
> element inside *a: struct (nullable = true).*
>
> On Mon, Jul 24, 2017 at 4:04 PM, Patrick <ti...@gmail.com> wrote:
>
>> Hi,
>>
>> On reading a complex JSON, Spark infers schema as following:
>>
>> root
>>  |-- header: struct (nullable = true)
>>  |    |-- deviceId: string (nullable = true)
>>  |    |-- sessionId: string (nullable = true)
>>  |-- payload: struct (nullable = true)
>>  |    |-- deviceObjects: array (nullable = true)
>>  |    |    |-- element: struct (containsNull = true)
>>  |    |    |    |-- additionalPayload: array (nullable = true)
>>  |    |    |    |    |-- element: struct (containsNull = true)
>>  |    |    |    |    |    |-- data: struct (nullable = true)
>>  |    |    |    |    |    |    |-- *a: struct (nullable = true)*
>>  |    |    |    |    |    |    |    |-- address: string (nullable = true)
>>
>> When we save the above Json in parquet using Spark sql we get only two
>> top level columns "header" and "payload" in parquet.
>>
>> So now we want to do a mean calculation on element  *a: struct (nullable
>> = true)*
>>
>> With reference to the Databricks blog for handling complex JSON
>> https://databricks.com/blog/2017/02/23/working-complex-data-
>> formats-structured-streaming-apache-spark-2-1.html
>>
>> *"when using Parquet, all struct columns will receive the same treatment
>> as top-level columns. Therefore, if you have filters on a nested field, you
>> will get the same benefits as a top-level column."*
>>
>> Referring to the above statement, will parquet treat *a: struct
>> (nullable = true)* as top-level column struct and SQL query on the
>> Dataset will be optimized?
>>
>> If not, do we need to externally impose the schema by exploding the
>> complex type before writing to parquet in order to get top-level column
>> benefit? What we can do with Spark 2.1, to extract the best performance
>> over such nested structure like *a: struct (nullable = true).*
>>
>> Thanks
>>
>>
>
>