You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by manjay kumar <ma...@gmail.com> on 2020/08/13 16:40:27 UTC

help on use case - spark parquet processing

Hi ,

I have a use case,

where i need to merge three data set and build one where ever data is
available.

And my dataset is a complex object.

Customer
- name - string
- accounts - List<Account>

Account
- type - String
- Adressess - List<Address>

Address
-name - String

----

---


And it goes on.

These file are in parquet ,


All 3 input datasets are having some details , which need to merge.

And build one dataset , which has all the information ( i know the files
which need to merge )


I want to know , how should I proceed on this  ??

- my approach is to build case class of actual output and parse the three
dataset.
 ( but this is failing because the input response have not all the fields).

So basically , what should be the approach to deal this kind of problem ?

2nd , how can i convert parquet dataframe to dataset, considering the
pauquet struct does not have all the fields. but case class has all the
field ( i am getting error no struct type found)

Thanks
Manjay Kumar
8320 120 839

Re: help on use case - spark parquet processing

Posted by Amit Sharma <re...@gmail.com>.
Can you keep option field in your case class.


Thanks
Amit

On Thu, Aug 13, 2020 at 12:47 PM manjay kumar <ma...@gmail.com>
wrote:

> Hi ,
>
> I have a use case,
>
> where i need to merge three data set and build one where ever data is
> available.
>
> And my dataset is a complex object.
>
> Customer
> - name - string
> - accounts - List<Account>
>
> Account
> - type - String
> - Adressess - List<Address>
>
> Address
> -name - String
>
> ----
>
> ---
>
>
> And it goes on.
>
> These file are in parquet ,
>
>
> All 3 input datasets are having some details , which need to merge.
>
> And build one dataset , which has all the information ( i know the files
> which need to merge )
>
>
> I want to know , how should I proceed on this  ??
>
> - my approach is to build case class of actual output and parse the three
> dataset.
>  ( but this is failing because the input response have not all the fields).
>
> So basically , what should be the approach to deal this kind of problem ?
>
> 2nd , how can i convert parquet dataframe to dataset, considering the
> pauquet struct does not have all the fields. but case class has all the
> field ( i am getting error no struct type found)
>
> Thanks
> Manjay Kumar
> 8320 120 839
>
>
>