You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by unk1102 <um...@gmail.com> on 2015/10/03 12:19:46 UTC

How to optimize group by query fired using hiveContext.sql?

Hi I have couple of Spark jobs which uses group by query which is getting
fired from hiveContext.sql() Now I know group by is evil but my use case I
cant avoid group by I have around 7-8 fields on which I need to do group by.
Also I am using df1.except(df2) which also seems heavy operation and does
lots of shuffling please see my UI snap
<http://apache-spark-user-list.1001560.n3.nabble.com/file/n24914/IMG_20151003_151830218.jpg> 

I have tried almost all optimisation including Spark 1.5 but nothing seems
to be working and my job fails hangs because of executor will reach physical
memory limit and YARN will kill it. I have around 1TB of data to process and
it is skewed. Please guide.



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-optimize-group-by-query-fired-using-hiveContext-sql-tp24914.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: How to optimize group by query fired using hiveContext.sql?

Posted by Umesh Kacha <um...@gmail.com>.

Hi thanks I usually get see the following errors in Spark logs and because
of that I think executor gets lost all of the following happens because
huge data shuffle and I cant avoid that dont know what to do please guide

15/08/16 12:26:46 WARN spark.HeartbeatReceiver: Removing executor 10
with no recent heartbeats:

1051638 ms exceeds timeout 1000000 ms

Or

org.apache.spark.shuffle.MetadataFetchFailedException: Missing an
output location for shuffle 0
at org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:384)
at org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:381)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
at org.apache.spark.MapOutputTracker$.org$apache$spark$MapOutputTracker$$convertMapStatuses(MapOutputTracker.scala:380)
at org.apache.spark.MapOutputTracker.getServerStatuses(MapOutputTracker.scala:176)
at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.fetch(BlockStoreShuffleFetcher.scala:42)
at org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:40)
at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:722)



OR YARN kills container because of

Container [pid=26783,containerID=container_1389136889967_0009_01_000002]
is running beyond physical memory limits. Current usage: 30.2 GB of 30
GB physical memory used; Killing container.


On Mon, Oct 5, 2015 at 8:00 AM, Alex Rovner <al...@magnetic.com>
wrote:

> Can you at least copy paste the error(s) you are seeing when the job
> fails? Without the error message(s), it's hard to even suggest anything.
>
> *Alex Rovner*
> *Director, Data Engineering *
> *o:* 646.759.0052
>
> * <http://www.magnetic.com/>*
>
> On Sat, Oct 3, 2015 at 9:50 AM, Umesh Kacha <um...@gmail.com> wrote:
>
>> Hi thanks I cant share yarn logs because of privacy in my company but I
>> can tell you I have seen yarn logs there I have not found anything except
>> YARN killing container because it is exceeds physical memory capacity.
>>
>> I am using the following command line script Above job launches around
>> 1500 ExecutorService threads from a driver with a thread pool of 15 so at a
>> time 15 jobs will be running as showing in UI.
>>
>> ./spark-submit --class com.xyz.abc.MySparkJob
>>
>> --conf "spark.executor.extraJavaOptions=-XX:MaxPermSize=512M" -
>>
>> -driver-java-options -XX:MaxPermSize=512m -
>>
>> -driver-memory 4g --master yarn-client
>>
>> --executor-memory 27G --executor-cores 2
>>
>> --num-executors 40
>>
>> --jars /path/to/others-jars
>>
>> /path/to/spark-job.jar
>>
>>
>> On Sat, Oct 3, 2015 at 7:11 PM, Alex Rovner <al...@magnetic.com>
>> wrote:
>>
>>> Can you send over your yarn logs along with the command you are using to
>>> submit your job?
>>>
>>> *Alex Rovner*
>>> *Director, Data Engineering *
>>> *o:* 646.759.0052
>>>
>>> * <http://www.magnetic.com/>*
>>>
>>> On Sat, Oct 3, 2015 at 9:07 AM, Umesh Kacha <um...@gmail.com>
>>> wrote:
>>>
>>>> Hi Alex thanks much for the reply. Please read the following for more
>>>> details about my problem.
>>>>
>>>>
>>>> http://stackoverflow.com/questions/32317285/spark-executor-oom-issue-on-yarn
>>>>
>>>> My each container has 8 core and 30 GB max memory. So I am using
>>>> yarn-client mode using 40 executors with 27GB/2 cores. If I use more cores
>>>> then my job start loosing more executors. I tried to set
>>>> spark.yarn.executor.memoryOverhead around 2 GB even 8 GB but it does
>>>> not help I loose executors no matter what. The reason is my jobs shuffles
>>>> lots of data even 20 GB of data in every job in UI I have seen it. Shuffle
>>>> happens because of group by and I cant avoid it in my case.
>>>>
>>>>
>>>>
>>>> On Sat, Oct 3, 2015 at 6:27 PM, Alex Rovner <al...@magnetic.com>
>>>> wrote:
>>>>
>>>>> This sounds like you need to increase YARN overhead settings with the "spark.yarn.executor.memoryOverhead"
>>>>> parameter. See
>>>>> http://spark.apache.org/docs/latest/running-on-yarn.html for more
>>>>> information on the setting.
>>>>>
>>>>> If that does not work for you, please provide the error messages and
>>>>> the command line you are using to submit your jobs for further
>>>>> troubleshooting.
>>>>>
>>>>>
>>>>> *Alex Rovner*
>>>>> *Director, Data Engineering *
>>>>> *o:* 646.759.0052
>>>>>
>>>>> * <http://www.magnetic.com/>*
>>>>>
>>>>> On Sat, Oct 3, 2015 at 6:19 AM, unk1102 <um...@gmail.com> wrote:
>>>>>
>>>>>> Hi I have couple of Spark jobs which uses group by query which is
>>>>>> getting
>>>>>> fired from hiveContext.sql() Now I know group by is evil but my use
>>>>>> case I
>>>>>> cant avoid group by I have around 7-8 fields on which I need to do
>>>>>> group by.
>>>>>> Also I am using df1.except(df2) which also seems heavy operation and
>>>>>> does
>>>>>> lots of shuffling please see my UI snap
>>>>>> <
>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/file/n24914/IMG_20151003_151830218.jpg
>>>>>> >
>>>>>>
>>>>>> I have tried almost all optimisation including Spark 1.5 but nothing
>>>>>> seems
>>>>>> to be working and my job fails hangs because of executor will reach
>>>>>> physical
>>>>>> memory limit and YARN will kill it. I have around 1TB of data to
>>>>>> process and
>>>>>> it is skewed. Please guide.
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> View this message in context:
>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-optimize-group-by-query-fired-using-hiveContext-sql-tp24914.html
>>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>>> Nabble.com.
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>>>> For additional commands, e-mail: user-help@spark.apache.org
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: How to optimize group by query fired using hiveContext.sql?

Posted by Alex Rovner <al...@magnetic.com>.

Can you at least copy paste the error(s) you are seeing when the job fails?
Without the error message(s), it's hard to even suggest anything.

*Alex Rovner*
*Director, Data Engineering *
*o:* 646.759.0052

* <http://www.magnetic.com/>*

On Sat, Oct 3, 2015 at 9:50 AM, Umesh Kacha <um...@gmail.com> wrote:

> Hi thanks I cant share yarn logs because of privacy in my company but I
> can tell you I have seen yarn logs there I have not found anything except
> YARN killing container because it is exceeds physical memory capacity.
>
> I am using the following command line script Above job launches around
> 1500 ExecutorService threads from a driver with a thread pool of 15 so at a
> time 15 jobs will be running as showing in UI.
>
> ./spark-submit --class com.xyz.abc.MySparkJob
>
> --conf "spark.executor.extraJavaOptions=-XX:MaxPermSize=512M" -
>
> -driver-java-options -XX:MaxPermSize=512m -
>
> -driver-memory 4g --master yarn-client
>
> --executor-memory 27G --executor-cores 2
>
> --num-executors 40
>
> --jars /path/to/others-jars
>
> /path/to/spark-job.jar
>
>
> On Sat, Oct 3, 2015 at 7:11 PM, Alex Rovner <al...@magnetic.com>
> wrote:
>
>> Can you send over your yarn logs along with the command you are using to
>> submit your job?
>>
>> *Alex Rovner*
>> *Director, Data Engineering *
>> *o:* 646.759.0052
>>
>> * <http://www.magnetic.com/>*
>>
>> On Sat, Oct 3, 2015 at 9:07 AM, Umesh Kacha <um...@gmail.com>
>> wrote:
>>
>>> Hi Alex thanks much for the reply. Please read the following for more
>>> details about my problem.
>>>
>>>
>>> http://stackoverflow.com/questions/32317285/spark-executor-oom-issue-on-yarn
>>>
>>> My each container has 8 core and 30 GB max memory. So I am using
>>> yarn-client mode using 40 executors with 27GB/2 cores. If I use more cores
>>> then my job start loosing more executors. I tried to set
>>> spark.yarn.executor.memoryOverhead around 2 GB even 8 GB but it does
>>> not help I loose executors no matter what. The reason is my jobs shuffles
>>> lots of data even 20 GB of data in every job in UI I have seen it. Shuffle
>>> happens because of group by and I cant avoid it in my case.
>>>
>>>
>>>
>>> On Sat, Oct 3, 2015 at 6:27 PM, Alex Rovner <al...@magnetic.com>
>>> wrote:
>>>
>>>> This sounds like you need to increase YARN overhead settings with the "spark.yarn.executor.memoryOverhead"
>>>> parameter. See http://spark.apache.org/docs/latest/running-on-yarn.html
>>>> for more information on the setting.
>>>>
>>>> If that does not work for you, please provide the error messages and
>>>> the command line you are using to submit your jobs for further
>>>> troubleshooting.
>>>>
>>>>
>>>> *Alex Rovner*
>>>> *Director, Data Engineering *
>>>> *o:* 646.759.0052
>>>>
>>>> * <http://www.magnetic.com/>*
>>>>
>>>> On Sat, Oct 3, 2015 at 6:19 AM, unk1102 <um...@gmail.com> wrote:
>>>>
>>>>> Hi I have couple of Spark jobs which uses group by query which is
>>>>> getting
>>>>> fired from hiveContext.sql() Now I know group by is evil but my use
>>>>> case I
>>>>> cant avoid group by I have around 7-8 fields on which I need to do
>>>>> group by.
>>>>> Also I am using df1.except(df2) which also seems heavy operation and
>>>>> does
>>>>> lots of shuffling please see my UI snap
>>>>> <
>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/file/n24914/IMG_20151003_151830218.jpg
>>>>> >
>>>>>
>>>>> I have tried almost all optimisation including Spark 1.5 but nothing
>>>>> seems
>>>>> to be working and my job fails hangs because of executor will reach
>>>>> physical
>>>>> memory limit and YARN will kill it. I have around 1TB of data to
>>>>> process and
>>>>> it is skewed. Please guide.
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> View this message in context:
>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-optimize-group-by-query-fired-using-hiveContext-sql-tp24914.html
>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>> Nabble.com.
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>>> For additional commands, e-mail: user-help@spark.apache.org
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: How to optimize group by query fired using hiveContext.sql?

Posted by Umesh Kacha <um...@gmail.com>.

Hi thanks I cant share yarn logs because of privacy in my company but I can
tell you I have seen yarn logs there I have not found anything except YARN
killing container because it is exceeds physical memory capacity.

I am using the following command line script Above job launches around 1500
ExecutorService threads from a driver with a thread pool of 15 so at a time
15 jobs will be running as showing in UI.

./spark-submit --class com.xyz.abc.MySparkJob

--conf "spark.executor.extraJavaOptions=-XX:MaxPermSize=512M" -

-driver-java-options -XX:MaxPermSize=512m -

-driver-memory 4g --master yarn-client

--executor-memory 27G --executor-cores 2

--num-executors 40

--jars /path/to/others-jars

/path/to/spark-job.jar


On Sat, Oct 3, 2015 at 7:11 PM, Alex Rovner <al...@magnetic.com>
wrote:

> Can you send over your yarn logs along with the command you are using to
> submit your job?
>
> *Alex Rovner*
> *Director, Data Engineering *
> *o:* 646.759.0052
>
> * <http://www.magnetic.com/>*
>
> On Sat, Oct 3, 2015 at 9:07 AM, Umesh Kacha <um...@gmail.com> wrote:
>
>> Hi Alex thanks much for the reply. Please read the following for more
>> details about my problem.
>>
>>
>> http://stackoverflow.com/questions/32317285/spark-executor-oom-issue-on-yarn
>>
>> My each container has 8 core and 30 GB max memory. So I am using
>> yarn-client mode using 40 executors with 27GB/2 cores. If I use more cores
>> then my job start loosing more executors. I tried to set
>> spark.yarn.executor.memoryOverhead around 2 GB even 8 GB but it does not
>> help I loose executors no matter what. The reason is my jobs shuffles lots
>> of data even 20 GB of data in every job in UI I have seen it. Shuffle
>> happens because of group by and I cant avoid it in my case.
>>
>>
>>
>> On Sat, Oct 3, 2015 at 6:27 PM, Alex Rovner <al...@magnetic.com>
>> wrote:
>>
>>> This sounds like you need to increase YARN overhead settings with the "spark.yarn.executor.memoryOverhead"
>>> parameter. See http://spark.apache.org/docs/latest/running-on-yarn.html
>>> for more information on the setting.
>>>
>>> If that does not work for you, please provide the error messages and the
>>> command line you are using to submit your jobs for further troubleshooting.
>>>
>>>
>>> *Alex Rovner*
>>> *Director, Data Engineering *
>>> *o:* 646.759.0052
>>>
>>> * <http://www.magnetic.com/>*
>>>
>>> On Sat, Oct 3, 2015 at 6:19 AM, unk1102 <um...@gmail.com> wrote:
>>>
>>>> Hi I have couple of Spark jobs which uses group by query which is
>>>> getting
>>>> fired from hiveContext.sql() Now I know group by is evil but my use
>>>> case I
>>>> cant avoid group by I have around 7-8 fields on which I need to do
>>>> group by.
>>>> Also I am using df1.except(df2) which also seems heavy operation and
>>>> does
>>>> lots of shuffling please see my UI snap
>>>> <
>>>> http://apache-spark-user-list.1001560.n3.nabble.com/file/n24914/IMG_20151003_151830218.jpg
>>>> >
>>>>
>>>> I have tried almost all optimisation including Spark 1.5 but nothing
>>>> seems
>>>> to be working and my job fails hangs because of executor will reach
>>>> physical
>>>> memory limit and YARN will kill it. I have around 1TB of data to
>>>> process and
>>>> it is skewed. Please guide.
>>>>
>>>>
>>>>
>>>> --
>>>> View this message in context:
>>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-optimize-group-by-query-fired-using-hiveContext-sql-tp24914.html
>>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>> For additional commands, e-mail: user-help@spark.apache.org
>>>>
>>>>
>>>
>>
>

Re: How to optimize group by query fired using hiveContext.sql?

Posted by Alex Rovner <al...@magnetic.com>.

Can you send over your yarn logs along with the command you are using to
submit your job?

*Alex Rovner*
*Director, Data Engineering *
*o:* 646.759.0052

* <http://www.magnetic.com/>*

On Sat, Oct 3, 2015 at 9:07 AM, Umesh Kacha <um...@gmail.com> wrote:

> Hi Alex thanks much for the reply. Please read the following for more
> details about my problem.
>
>
> http://stackoverflow.com/questions/32317285/spark-executor-oom-issue-on-yarn
>
> My each container has 8 core and 30 GB max memory. So I am using
> yarn-client mode using 40 executors with 27GB/2 cores. If I use more cores
> then my job start loosing more executors. I tried to set
> spark.yarn.executor.memoryOverhead around 2 GB even 8 GB but it does not
> help I loose executors no matter what. The reason is my jobs shuffles lots
> of data even 20 GB of data in every job in UI I have seen it. Shuffle
> happens because of group by and I cant avoid it in my case.
>
>
>
> On Sat, Oct 3, 2015 at 6:27 PM, Alex Rovner <al...@magnetic.com>
> wrote:
>
>> This sounds like you need to increase YARN overhead settings with the "spark.yarn.executor.memoryOverhead"
>> parameter. See http://spark.apache.org/docs/latest/running-on-yarn.html
>> for more information on the setting.
>>
>> If that does not work for you, please provide the error messages and the
>> command line you are using to submit your jobs for further troubleshooting.
>>
>>
>> *Alex Rovner*
>> *Director, Data Engineering *
>> *o:* 646.759.0052
>>
>> * <http://www.magnetic.com/>*
>>
>> On Sat, Oct 3, 2015 at 6:19 AM, unk1102 <um...@gmail.com> wrote:
>>
>>> Hi I have couple of Spark jobs which uses group by query which is getting
>>> fired from hiveContext.sql() Now I know group by is evil but my use case
>>> I
>>> cant avoid group by I have around 7-8 fields on which I need to do group
>>> by.
>>> Also I am using df1.except(df2) which also seems heavy operation and does
>>> lots of shuffling please see my UI snap
>>> <
>>> http://apache-spark-user-list.1001560.n3.nabble.com/file/n24914/IMG_20151003_151830218.jpg
>>> >
>>>
>>> I have tried almost all optimisation including Spark 1.5 but nothing
>>> seems
>>> to be working and my job fails hangs because of executor will reach
>>> physical
>>> memory limit and YARN will kill it. I have around 1TB of data to process
>>> and
>>> it is skewed. Please guide.
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-optimize-group-by-query-fired-using-hiveContext-sql-tp24914.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>> For additional commands, e-mail: user-help@spark.apache.org
>>>
>>>
>>
>

Re: How to optimize group by query fired using hiveContext.sql?

Posted by Umesh Kacha <um...@gmail.com>.

Hi Alex thanks much for the reply. Please read the following for more
details about my problem.

http://stackoverflow.com/questions/32317285/spark-executor-oom-issue-on-yarn

My each container has 8 core and 30 GB max memory. So I am using
yarn-client mode using 40 executors with 27GB/2 cores. If I use more cores
then my job start loosing more executors. I tried to set
spark.yarn.executor.memoryOverhead around 2 GB even 8 GB but it does not
help I loose executors no matter what. The reason is my jobs shuffles lots
of data even 20 GB of data in every job in UI I have seen it. Shuffle
happens because of group by and I cant avoid it in my case.



On Sat, Oct 3, 2015 at 6:27 PM, Alex Rovner <al...@magnetic.com>
wrote:

> This sounds like you need to increase YARN overhead settings with the "spark.yarn.executor.memoryOverhead"
> parameter. See http://spark.apache.org/docs/latest/running-on-yarn.html
> for more information on the setting.
>
> If that does not work for you, please provide the error messages and the
> command line you are using to submit your jobs for further troubleshooting.
>
>
> *Alex Rovner*
> *Director, Data Engineering *
> *o:* 646.759.0052
>
> * <http://www.magnetic.com/>*
>
> On Sat, Oct 3, 2015 at 6:19 AM, unk1102 <um...@gmail.com> wrote:
>
>> Hi I have couple of Spark jobs which uses group by query which is getting
>> fired from hiveContext.sql() Now I know group by is evil but my use case I
>> cant avoid group by I have around 7-8 fields on which I need to do group
>> by.
>> Also I am using df1.except(df2) which also seems heavy operation and does
>> lots of shuffling please see my UI snap
>> <
>> http://apache-spark-user-list.1001560.n3.nabble.com/file/n24914/IMG_20151003_151830218.jpg
>> >
>>
>> I have tried almost all optimisation including Spark 1.5 but nothing seems
>> to be working and my job fails hangs because of executor will reach
>> physical
>> memory limit and YARN will kill it. I have around 1TB of data to process
>> and
>> it is skewed. Please guide.
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-optimize-group-by-query-fired-using-hiveContext-sql-tp24914.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>
>>
>

Re: How to optimize group by query fired using hiveContext.sql?

Posted by Alex Rovner <al...@magnetic.com>.

This sounds like you need to increase YARN overhead settings with the
"spark.yarn.executor.memoryOverhead"
parameter. See http://spark.apache.org/docs/latest/running-on-yarn.html for
more information on the setting.

If that does not work for you, please provide the error messages and the
command line you are using to submit your jobs for further troubleshooting.


*Alex Rovner*
*Director, Data Engineering *
*o:* 646.759.0052

* <http://www.magnetic.com/>*

On Sat, Oct 3, 2015 at 6:19 AM, unk1102 <um...@gmail.com> wrote:

> Hi I have couple of Spark jobs which uses group by query which is getting
> fired from hiveContext.sql() Now I know group by is evil but my use case I
> cant avoid group by I have around 7-8 fields on which I need to do group
> by.
> Also I am using df1.except(df2) which also seems heavy operation and does
> lots of shuffling please see my UI snap
> <
> http://apache-spark-user-list.1001560.n3.nabble.com/file/n24914/IMG_20151003_151830218.jpg
> >
>
> I have tried almost all optimisation including Spark 1.5 but nothing seems
> to be working and my job fails hangs because of executor will reach
> physical
> memory limit and YARN will kill it. I have around 1TB of data to process
> and
> it is skewed. Please guide.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-optimize-group-by-query-fired-using-hiveContext-sql-tp24914.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>