You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Gourav Sengupta <go...@gmail.com> on 2016/01/19 12:24:44 UTC

storing query object

Hi,

I have a SPARK table (created from hiveContext) with couple of hundred
partitions and few thousand files.

When I run query on the table then spark spends a lot of time (as seen in
the pyspark output) to collect this files from the several partitions.
After this the query starts running.

Is there a way to store the object which has collected all these partitions
and files so that every time I restart the job I load this object instead
of taking  50 mins to just collect the files before starting to run the
query?


Please do let me know in case the question is not quite clear.

Regards,
Gourav Sengupta

Re: storing query object

Posted by Ted Yu <yu...@gmail.com>.

In SQLConf.scala , I found this:

  val PARALLEL_PARTITION_DISCOVERY_THRESHOLD = intConf(
    key = "spark.sql.sources.parallelPartitionDiscovery.threshold",
    defaultValue = Some(32),
    doc = "The degree of parallelism for schema merging and partition
discovery of " +
      "Parquet data sources.")

But looks like it may not help your case.

FYI

On Fri, Jan 22, 2016 at 3:09 AM, Gourav Sengupta <go...@gmail.com>
wrote:

> Hi Ted,
>
> I am using SPARK 1.5.2 as available currently in AWS EMR 4x. The data is
> in TSV format.
>
> I do not see any affect of the work already done on this for the data
> stored in HIVE as it takes around 50 mins just to collect the table
> metadata over a 40 node cluster and the time is much the same for smaller
> clusters of size 20.
>
> Spending 50 mins just to collect the meta-data is fine for once, but we
> should be then able to store the object (which is in memory after reading
> the meta-data for the first time) so that next time we can just restore the
> object instead of reading the meta-data once again. Or we should be able to
> parallelize the collection of meta-data so that it does not take such a
> long time.
>
> Please advice.
>
> Regards,
> Gourav
>
>
>
>
> On Fri, Jan 22, 2016 at 10:15 AM, Ted Yu <yu...@gmail.com> wrote:
>
>> There have been optimizations in this area, such as:
>> https://issues.apache.org/jira/browse/SPARK-8125
>>
>> You can also look at parent issue.
>>
>> Which Spark release are you using ?
>>
>> > On Jan 22, 2016, at 1:08 AM, Gourav Sengupta <go...@gmail.com>
>> wrote:
>> >
>> >
>> > Hi,
>> >
>> > I have a SPARK table (created from hiveContext) with couple of hundred
>> partitions and few thousand files.
>> >
>> > When I run query on the table then spark spends a lot of time (as seen
>> in the pyspark output) to collect this files from the several partitions.
>> After this the query starts running.
>> >
>> > Is there a way to store the object which has collected all these
>> partitions and files so that every time I restart the job I load this
>> object instead of taking  50 mins to just collect the files before starting
>> to run the query?
>> >
>> >
>> > Please do let me know in case the question is not quite clear.
>> >
>> > Regards,
>> > Gourav Sengupta
>> >
>> >
>> >
>>
>
>

Re: storing query object

Posted by Gourav Sengupta <go...@gmail.com>.

Hi Ted,

I am using SPARK 1.5.2 as available currently in AWS EMR 4x. The data is in
TSV format.

I do not see any affect of the work already done on this for the data
stored in HIVE as it takes around 50 mins just to collect the table
metadata over a 40 node cluster and the time is much the same for smaller
clusters of size 20.

Spending 50 mins just to collect the meta-data is fine for once, but we
should be then able to store the object (which is in memory after reading
the meta-data for the first time) so that next time we can just restore the
object instead of reading the meta-data once again. Or we should be able to
parallelize the collection of meta-data so that it does not take such a
long time.

Please advice.

Regards,
Gourav

On Fri, Jan 22, 2016 at 10:15 AM, Ted Yu <yu...@gmail.com> wrote:

> There have been optimizations in this area, such as:
> https://issues.apache.org/jira/browse/SPARK-8125
>
> You can also look at parent issue.
>
> Which Spark release are you using ?
>
> > On Jan 22, 2016, at 1:08 AM, Gourav Sengupta <go...@gmail.com>
> wrote:
> >
> >
> > Hi,
> >
> > I have a SPARK table (created from hiveContext) with couple of hundred
> partitions and few thousand files.
> >
> > When I run query on the table then spark spends a lot of time (as seen
> in the pyspark output) to collect this files from the several partitions.
> After this the query starts running.
> >
> > Is there a way to store the object which has collected all these
> partitions and files so that every time I restart the job I load this
> object instead of taking  50 mins to just collect the files before starting
> to run the query?
> >
> >
> > Please do let me know in case the question is not quite clear.
> >
> > Regards,
> > Gourav Sengupta
> >
> >
> >
>

Re: storing query object

Posted by Ted Yu <yu...@gmail.com>.

There have been optimizations in this area, such as:
https://issues.apache.org/jira/browse/SPARK-8125

You can also look at parent issue. 

Which Spark release are you using ?

> On Jan 22, 2016, at 1:08 AM, Gourav Sengupta <go...@gmail.com> wrote:
> 
> 
> Hi,
> 
> I have a SPARK table (created from hiveContext) with couple of hundred partitions and few thousand files. 
> 
> When I run query on the table then spark spends a lot of time (as seen in the pyspark output) to collect this files from the several partitions. After this the query starts running. 
> 
> Is there a way to store the object which has collected all these partitions and files so that every time I restart the job I load this object instead of taking  50 mins to just collect the files before starting to run the query?
> 
> 
> Please do let me know in case the question is not quite clear.
> 
> Regards,
> Gourav Sengupta 
> 
> 
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Fwd: storing query object

Posted by Gourav Sengupta <go...@gmail.com>.

Hi,

I have a SPARK table (created from hiveContext) with couple of hundred
partitions and few thousand files.

When I run query on the table then spark spends a lot of time (as seen in
the pyspark output) to collect this files from the several partitions.
After this the query starts running.

Is there a way to store the object which has collected all these partitions
and files so that every time I restart the job I load this object instead
of taking  50 mins to just collect the files before starting to run the
query?


Please do let me know in case the question is not quite clear.

Regards,
Gourav Sengupta