You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Patrick <ti...@gmail.com> on 2017/07/15 17:41:08 UTC

Querying on Deeply Nested JSON Structures

Hi,

We need to query deeply nested Json structure. However query is on a single
field at a nested level such as mean, median, mode.

I am aware of the sql explode function.

df = df_nested.withColumn('exploded', explode(top))

But this is too slow.

Is there any other strategy that could give us the best performance in
querying nested json in Spark Dataset.


Thanks

Re: Querying on Deeply Nested JSON Structures

Posted by Burak Yavuz <br...@gmail.com>.

Have you checked out this blog post?
https://databricks.com/blog/2017/02/23/working-complex-data-formats-structured-streaming-apache-spark-2-1.html

Shows tools and tips on how to work with nested data. You can access data
through `field1.field2.field3` and such with JSON.

Best,
Burak

On Sat, Jul 15, 2017 at 10:45 AM, Matt Deaver <ma...@gmail.com> wrote:

> I would love to be told otherwise, but I believe your options are to
> either 1) use the explode function or 2) pre-process the data so you don't
> have to explode it.
>
> On Jul 15, 2017 11:41 AM, "Patrick" <ti...@gmail.com> wrote:
>
>> Hi,
>>
>> We need to query deeply nested Json structure. However query is on a
>> single field at a nested level such as mean, median, mode.
>>
>> I am aware of the sql explode function.
>>
>> df = df_nested.withColumn('exploded', explode(top))
>>
>> But this is too slow.
>>
>> Is there any other strategy that could give us the best performance in querying nested json in Spark Dataset.
>>
>>
>> Thanks
>>
>>
>>

Re: Querying on Deeply Nested JSON Structures

Posted by Matt Deaver <ma...@gmail.com>.

I would love to be told otherwise, but I believe your options are to either
1) use the explode function or 2) pre-process the data so you don't have to
explode it.

On Jul 15, 2017 11:41 AM, "Patrick" <ti...@gmail.com> wrote:

> Hi,
>
> We need to query deeply nested Json structure. However query is on a
> single field at a nested level such as mean, median, mode.
>
> I am aware of the sql explode function.
>
> df = df_nested.withColumn('exploded', explode(top))
>
> But this is too slow.
>
> Is there any other strategy that could give us the best performance in querying nested json in Spark Dataset.
>
>
> Thanks
>
>
>