You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Aditya Borde <bo...@gmail.com> on 2017/03/21 14:53:12 UTC

Merging Schema while reading Parquet files

Hello,

I'm currently blocked with this issue:

I have job "A" whose output is partitioned by one of the field - "col1"
Now job "B" reads the output of job "A".

Here comes the problem. my job "A" output previously not been partitioned
by "col1" (this is recent change).
But the thing is now, all my previous data has not been partitioned by
"col1" for job "A".
If I want to run my job "B" without any issue with previous as well as
current data - it is failing as because : "inconsistent partition column
names"

*Reading Path is something like - "file://path1/name/sample/"* ---> but
further it has directories *"day=2017-02-15/filling=5/xyz1"*

Currently it is generating one more deeper directory input path --> "
*/day=2017-02-15/filling=5/col1/xyz2"*

"mergeSchema" - is not working here because my base path has multiple
directories under which files are residing.

Can someone suggest me some effective solution here?

Regards,
Aditya Borde

Re: Merging Schema while reading Parquet files

Posted by Matt Deaver <ma...@gmail.com>.

You could create a one-time job that processes historical data to match the
updated format

On Tue, Mar 21, 2017 at 8:53 AM, Aditya Borde <bo...@gmail.com> wrote:

> Hello,
>
> I'm currently blocked with this issue:
>
> I have job "A" whose output is partitioned by one of the field - "col1"
> Now job "B" reads the output of job "A".
>
> Here comes the problem. my job "A" output previously not been partitioned
> by "col1" (this is recent change).
> But the thing is now, all my previous data has not been partitioned by
> "col1" for job "A".
> If I want to run my job "B" without any issue with previous as well as
> current data - it is failing as because : "inconsistent partition column
> names"
>
> *Reading Path is something like - "file://path1/name/sample/"* ---> but
> further it has directories *"day=2017-02-15/filling=5/xyz1"*
>
> Currently it is generating one more deeper directory input path --> "
> */day=2017-02-15/filling=5/col1/xyz2"*
>
> "mergeSchema" - is not working here because my base path has multiple
> directories under which files are residing.
>
> Can someone suggest me some effective solution here?
>
> Regards,
> Aditya Borde
>



-- 
Regards,

Matt
Data Engineer
https://www.linkedin.com/in/mdeaver
http://mattdeav.pythonanywhere.com/