You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Gourav Sengupta <go...@gmail.com> on 2016/01/19 12:24:44 UTC
storing query object
Hi,
I have a SPARK table (created from hiveContext) with couple of hundred
partitions and few thousand files.
When I run query on the table then spark spends a lot of time (as seen in
the pyspark output) to collect this files from the several partitions.
After this the query starts running.
Is there a way to store the object which has collected all these partitions
and files so that every time I restart the job I load this object instead
of taking 50 mins to just collect the files before starting to run the
query?
Please do let me know in case the question is not quite clear.
Regards,
Gourav Sengupta
Re: storing query object
Posted by Ted Yu <yu...@gmail.com>.
In SQLConf.scala , I found this:
val PARALLEL_PARTITION_DISCOVERY_THRESHOLD = intConf(
key = "spark.sql.sources.parallelPartitionDiscovery.threshold",
defaultValue = Some(32),
doc = "The degree of parallelism for schema merging and partition
discovery of " +
"Parquet data sources.")
But looks like it may not help your case.
FYI
On Fri, Jan 22, 2016 at 3:09 AM, Gourav Sengupta <go...@gmail.com>
wrote:
> Hi Ted,
>
> I am using SPARK 1.5.2 as available currently in AWS EMR 4x. The data is
> in TSV format.
>
> I do not see any affect of the work already done on this for the data
> stored in HIVE as it takes around 50 mins just to collect the table
> metadata over a 40 node cluster and the time is much the same for smaller
> clusters of size 20.
>
> Spending 50 mins just to collect the meta-data is fine for once, but we
> should be then able to store the object (which is in memory after reading
> the meta-data for the first time) so that next time we can just restore the
> object instead of reading the meta-data once again. Or we should be able to
> parallelize the collection of meta-data so that it does not take such a
> long time.
>
> Please advice.
>
> Regards,
> Gourav
>
>
>
>
> On Fri, Jan 22, 2016 at 10:15 AM, Ted Yu <yu...@gmail.com> wrote:
>
>> There have been optimizations in this area, such as:
>> https://issues.apache.org/jira/browse/SPARK-8125
>>
>> You can also look at parent issue.
>>
>> Which Spark release are you using ?
>>
>> > On Jan 22, 2016, at 1:08 AM, Gourav Sengupta <go...@gmail.com>
>> wrote:
>> >
>> >
>> > Hi,
>> >
>> > I have a SPARK table (created from hiveContext) with couple of hundred
>> partitions and few thousand files.
>> >
>> > When I run query on the table then spark spends a lot of time (as seen
>> in the pyspark output) to collect this files from the several partitions.
>> After this the query starts running.
>> >
>> > Is there a way to store the object which has collected all these
>> partitions and files so that every time I restart the job I load this
>> object instead of taking 50 mins to just collect the files before starting
>> to run the query?
>> >
>> >
>> > Please do let me know in case the question is not quite clear.
>> >
>> > Regards,
>> > Gourav Sengupta
>> >
>> >
>> >
>>
>
>
Re: storing query object
Posted by Gourav Sengupta <go...@gmail.com>.
Hi Ted,
I am using SPARK 1.5.2 as available currently in AWS EMR 4x. The data is in
TSV format.
I do not see any affect of the work already done on this for the data
stored in HIVE as it takes around 50 mins just to collect the table
metadata over a 40 node cluster and the time is much the same for smaller
clusters of size 20.
Spending 50 mins just to collect the meta-data is fine for once, but we
should be then able to store the object (which is in memory after reading
the meta-data for the first time) so that next time we can just restore the
object instead of reading the meta-data once again. Or we should be able to
parallelize the collection of meta-data so that it does not take such a
long time.
Please advice.
Regards,
Gourav
On Fri, Jan 22, 2016 at 10:15 AM, Ted Yu <yu...@gmail.com> wrote:
> There have been optimizations in this area, such as:
> https://issues.apache.org/jira/browse/SPARK-8125
>
> You can also look at parent issue.
>
> Which Spark release are you using ?
>
> > On Jan 22, 2016, at 1:08 AM, Gourav Sengupta <go...@gmail.com>
> wrote:
> >
> >
> > Hi,
> >
> > I have a SPARK table (created from hiveContext) with couple of hundred
> partitions and few thousand files.
> >
> > When I run query on the table then spark spends a lot of time (as seen
> in the pyspark output) to collect this files from the several partitions.
> After this the query starts running.
> >
> > Is there a way to store the object which has collected all these
> partitions and files so that every time I restart the job I load this
> object instead of taking 50 mins to just collect the files before starting
> to run the query?
> >
> >
> > Please do let me know in case the question is not quite clear.
> >
> > Regards,
> > Gourav Sengupta
> >
> >
> >
>
Re: storing query object
Posted by Ted Yu <yu...@gmail.com>.
There have been optimizations in this area, such as:
https://issues.apache.org/jira/browse/SPARK-8125
You can also look at parent issue.
Which Spark release are you using ?
> On Jan 22, 2016, at 1:08 AM, Gourav Sengupta <go...@gmail.com> wrote:
>
>
> Hi,
>
> I have a SPARK table (created from hiveContext) with couple of hundred partitions and few thousand files.
>
> When I run query on the table then spark spends a lot of time (as seen in the pyspark output) to collect this files from the several partitions. After this the query starts running.
>
> Is there a way to store the object which has collected all these partitions and files so that every time I restart the job I load this object instead of taking 50 mins to just collect the files before starting to run the query?
>
>
> Please do let me know in case the question is not quite clear.
>
> Regards,
> Gourav Sengupta
>
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org
Fwd: storing query object
Posted by Gourav Sengupta <go...@gmail.com>.
Hi,
I have a SPARK table (created from hiveContext) with couple of hundred
partitions and few thousand files.
When I run query on the table then spark spends a lot of time (as seen in
the pyspark output) to collect this files from the several partitions.
After this the query starts running.
Is there a way to store the object which has collected all these partitions
and files so that every time I restart the job I load this object instead
of taking 50 mins to just collect the files before starting to run the
query?
Please do let me know in case the question is not quite clear.
Regards,
Gourav Sengupta