You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Aseem Bansal <as...@gmail.com> on 2017/02/14 14:30:35 UTC

Dealing with missing columns in SPARK SQL in JSON

Say I have two files containing single rows

json1.json

{"a": 1}

json2.json

{"b": 2}

I read in this json file using spark's API into a dataframe one at a time.
So I have

Dataset json1DF
and
Dataset json2DF

If I run "select a, b from __THIS__" in a SQLTransformer then I will get an
exception as for json1DF does not have "b" and json2DF does not have "a"

How could I handle this situation with missing columns in JSON?

Re: Dealing with missing columns in SPARK SQL in JSON

Posted by Sam Elamin <hu...@gmail.com>.
ah if thats the case then you might need to define the schema before hand.
Either that or if you want to infer it then ensure a jsonfile exists with
the right schema so spark infers the right columns

essentially making both files one dataframe if that makes sense

On Tue, Feb 14, 2017 at 3:04 PM, Aseem Bansal <as...@gmail.com> wrote:

> Sorry if I trivialized the example. It is the same kind of file and
> sometimes it could have "a", sometimes "b", sometimes both. I just don't
> know. That is what I meant by missing columns.
>
> It would be good if I read any of the JSON and if I do spark sql and it
> gave me
>
> for json1.json
>
> a | b
> 1 | null
>
> for json2.json
>
> a     | b
> null | 2
>
>
> On Tue, Feb 14, 2017 at 8:13 PM, Sam Elamin <hu...@gmail.com>
> wrote:
>
>> I may be missing something super obvious here but can't you combine them
>> into a single dataframe. Left join perhaps?
>>
>> Try writing it in sql " select a from json1 and b from josn2"then run
>> explain to give you a hint to how to do it in code
>>
>> Regards
>> Sam
>> On Tue, 14 Feb 2017 at 14:30, Aseem Bansal <as...@gmail.com> wrote:
>>
>>> Say I have two files containing single rows
>>>
>>> json1.json
>>>
>>> {"a": 1}
>>>
>>> json2.json
>>>
>>> {"b": 2}
>>>
>>> I read in this json file using spark's API into a dataframe one at a
>>> time. So I have
>>>
>>> Dataset json1DF
>>> and
>>> Dataset json2DF
>>>
>>> If I run "select a, b from __THIS__" in a SQLTransformer then I will get
>>> an exception as for json1DF does not have "b" and json2DF does not have "a"
>>>
>>> How could I handle this situation with missing columns in JSON?
>>>
>>
>

Re: Dealing with missing columns in SPARK SQL in JSON

Posted by Aseem Bansal <as...@gmail.com>.
Sorry if I trivialized the example. It is the same kind of file and
sometimes it could have "a", sometimes "b", sometimes both. I just don't
know. That is what I meant by missing columns.

It would be good if I read any of the JSON and if I do spark sql and it
gave me

for json1.json

a | b
1 | null

for json2.json

a     | b
null | 2


On Tue, Feb 14, 2017 at 8:13 PM, Sam Elamin <hu...@gmail.com> wrote:

> I may be missing something super obvious here but can't you combine them
> into a single dataframe. Left join perhaps?
>
> Try writing it in sql " select a from json1 and b from josn2"then run
> explain to give you a hint to how to do it in code
>
> Regards
> Sam
> On Tue, 14 Feb 2017 at 14:30, Aseem Bansal <as...@gmail.com> wrote:
>
>> Say I have two files containing single rows
>>
>> json1.json
>>
>> {"a": 1}
>>
>> json2.json
>>
>> {"b": 2}
>>
>> I read in this json file using spark's API into a dataframe one at a
>> time. So I have
>>
>> Dataset json1DF
>> and
>> Dataset json2DF
>>
>> If I run "select a, b from __THIS__" in a SQLTransformer then I will get
>> an exception as for json1DF does not have "b" and json2DF does not have "a"
>>
>> How could I handle this situation with missing columns in JSON?
>>
>

Re: Dealing with missing columns in SPARK SQL in JSON

Posted by Sam Elamin <hu...@gmail.com>.
I may be missing something super obvious here but can't you combine them
into a single dataframe. Left join perhaps?

Try writing it in sql " select a from json1 and b from josn2"then run
explain to give you a hint to how to do it in code

Regards
Sam
On Tue, 14 Feb 2017 at 14:30, Aseem Bansal <as...@gmail.com> wrote:

> Say I have two files containing single rows
>
> json1.json
>
> {"a": 1}
>
> json2.json
>
> {"b": 2}
>
> I read in this json file using spark's API into a dataframe one at a time.
> So I have
>
> Dataset json1DF
> and
> Dataset json2DF
>
> If I run "select a, b from __THIS__" in a SQLTransformer then I will get
> an exception as for json1DF does not have "b" and json2DF does not have "a"
>
> How could I handle this situation with missing columns in JSON?
>