You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by "Thakrar, Jayesh" <jt...@conversantmedia.com> on 2018/11/21 14:17:23 UTC
Double pass over ORC data files even after supplying schema and
setting inferSchema = false
Hi All,
We have some batch processing where we read 100s of thousands of ORC files.
What I found is that this was taking too much time AND that there was a long pause between the point the read begins in the code and the executors get into action.
That period is about 1.5+ hours where only the driver seems to be busy.
I have a feeling that this is due to double pass over the data for schema inference AND validation (e.g. if one of the files has a missing field, there is an exception).
I tried providing the schema upfront as well as setting inferSchema to false, yet the same thing happens.
Is there any explanation for this and is there any way to avoid it?
Thanks,
Jayesh
Re: Double pass over ORC data files even after supplying schema and
setting inferSchema = false
Posted by "Thakrar, Jayesh" <jt...@conversantmedia.com>.
Thank you for the quick reply Dongjoon.
This sound interesting and it might the resolution for our issue.
Let me see do some tests and will update the thread.
Thanks,
Jayesh
From: Dongjoon Hyun <do...@gmail.com>
Date: Wednesday, November 21, 2018 at 11:46 AM
To: "Thakrar, Jayesh" <jt...@conversantmedia.com>
Cc: dev <de...@spark.apache.org>
Subject: Re: Double pass over ORC data files even after supplying schema and setting inferSchema = false
Hi, Thakrar.
Which version are you using now? If it's below Spark 2.4.0, please try to use 2.4.0.
There was an improvement related to that.
https://issues.apache.org/jira/browse/SPARK-25126
Bests,
Dongjoon.
On Wed, Nov 21, 2018 at 6:17 AM Thakrar, Jayesh <jt...@conversantmedia.com>> wrote:
Hi All,
We have some batch processing where we read 100s of thousands of ORC files.
What I found is that this was taking too much time AND that there was a long pause between the point the read begins in the code and the executors get into action.
That period is about 1.5+ hours where only the driver seems to be busy.
I have a feeling that this is due to double pass over the data for schema inference AND validation (e.g. if one of the files has a missing field, there is an exception).
I tried providing the schema upfront as well as setting inferSchema to false, yet the same thing happens.
Is there any explanation for this and is there any way to avoid it?
Thanks,
Jayesh
Re: Double pass over ORC data files even after supplying schema and
setting inferSchema = false
Posted by Dongjoon Hyun <do...@gmail.com>.
Hi, Thakrar.
Which version are you using now? If it's below Spark 2.4.0, please try to
use 2.4.0.
There was an improvement related to that.
https://issues.apache.org/jira/browse/SPARK-25126
Bests,
Dongjoon.
On Wed, Nov 21, 2018 at 6:17 AM Thakrar, Jayesh <
jthakrar@conversantmedia.com> wrote:
> Hi All,
>
>
>
> We have some batch processing where we read 100s of thousands of ORC files.
>
> What I found is that this was taking too much time AND that there was a
> long pause between the point the read begins in the code and the executors
> get into action.
>
> That period is about 1.5+ hours where only the driver seems to be busy.
>
>
>
> I have a feeling that this is due to double pass over the data for schema
> inference AND validation (e.g. if one of the files has a missing field,
> there is an exception).
>
> I tried providing the schema upfront as well as setting inferSchema to
> false, yet the same thing happens.
>
>
>
> Is there any explanation for this and is there any way to avoid it?
>
>
>
> Thanks,
>
> Jayesh
>
>
>