You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Zhiliang Zhu <zc...@yahoo.com.INVALID> on 2016/07/18 10:04:34 UTC

the spark job is so slow - almost frozen

Hi All,  
Here we have one application, it needs to extract different columns from 6 hive tables, and then does some easy calculation, there is around 100,000 number of rows in each table,finally need to output another table or file (with format of consistent columns) .
 However, after lots of days trying, the spark hive job is unthinkably slow - sometimes almost frozen. There is 5 nodes for spark cluster.  Could anyone offer some help, some idea or clue is also good. 
Thanks in advance~
Zhiliang

Re: the spark job is so slow - almost frozen

Posted by Gourav Sengupta <go...@gmail.com>.

Andrew,

you have pretty much consolidated my entire experience, please give a
presentation in a meetup on this, and send across the links :)


Regards,
Gourav

On Wed, Jul 20, 2016 at 4:35 AM, Andrew Ehrlich <an...@aehrlich.com> wrote:

> Try:
>
> - filtering down the data as soon as possible in the job, dropping columns
> you don’t need.
> - processing fewer partitions of the hive tables at a time
> - caching frequently accessed data, for example dimension tables, lookup
> tables, or other datasets that are repeatedly accessed
> - using the Spark UI to identify the bottlenecked resource
> - remove features or columns from the output data, until it runs, then add
> them back in one at a time.
> - creating a static dataset small enough to work, and editing the query,
> then retesting, repeatedly until you cut the execution time by a
> significant fraction
> - Using the Spark UI or spark shell to check the skew and make sure
> partitions are evenly distributed
>
> On Jul 18, 2016, at 3:33 AM, Zhiliang Zhu <zchl.jump@yahoo.com.INVALID
> <zc...@yahoo.com.invalid>> wrote:
>
> Thanks a lot for your reply .
>
> In effect , here we tried to run the sql on kettle, hive and spark hive
> (by HiveContext) respectively, the job seems frozen  to finish to run .
>
> In the 6 tables , need to respectively read the different columns in
> different tables for specific information , then do some simple calculation
> before output .
> join operation is used most in the sql .
>
> Best wishes!
>
>
>
>
> On Monday, July 18, 2016 6:24 PM, Chanh Le <gi...@gmail.com> wrote:
>
>
> Hi,
> What about the network (bandwidth) between hive and spark?
> Does it run in Hive before then you move to Spark?
> Because It's complex you can use something like EXPLAIN command to show
> what going on.
>
>
>
>
>
>
> On Jul 18, 2016, at 5:20 PM, Zhiliang Zhu <zchl.jump@yahoo.com.INVALID
> <zc...@yahoo.com.invalid>> wrote:
>
> the sql logic in the program is very much complex , so do not describe the
> detailed codes   here .
>
>
> On Monday, July 18, 2016 6:04 PM, Zhiliang Zhu <
> zchl.jump@yahoo.com.INVALID <zc...@yahoo.com.invalid>> wrote:
>
>
> Hi All,
>
> Here we have one application, it needs to extract different columns from 6
> hive tables, and then does some easy calculation, there is around 100,000
> number of rows in each table,
> finally need to output another table or file (with format of consistent
> columns) .
>
>  However, after lots of days trying, the spark hive job is unthinkably
> slow - sometimes almost frozen. There is 5 nodes for spark cluster.
>
> Could anyone offer some help, some idea or clue is also good.
>
> Thanks in advance~
>
> Zhiliang
>
>
>
>
>
>
>

Re: the spark job is so slow - almost frozen

Posted by Zhiliang Zhu <zc...@yahoo.com.INVALID>.

Thanks a lot for your kind help.  

    On Wednesday, July 20, 2016 11:35 AM, Andrew Ehrlich <an...@aehrlich.com> wrote:
 

 Try:
- filtering down the data as soon as possible in the job, dropping columns you don’t need.- processing fewer partitions of the hive tables at a time- caching frequently accessed data, for example dimension tables, lookup tables, or other datasets that are repeatedly accessed- using the Spark UI to identify the bottlenecked resource- remove features or columns from the output data, until it runs, then add them back in one at a time.- creating a static dataset small enough to work, and editing the query, then retesting, repeatedly until you cut the execution time by a significant fraction- Using the Spark UI or spark shell to check the skew and make sure partitions are evenly distributed

On Jul 18, 2016, at 3:33 AM, Zhiliang Zhu <zc...@yahoo.com.INVALID> wrote:
Thanks a lot for your reply .
In effect , here we tried to run the sql on kettle, hive and spark hive (by HiveContext) respectively, the job seems frozen  to finish to run .
In the 6 tables , need to respectively read the different columns in different tables for specific information , then do some simple calculation before output . join operation is used most in the sql . 
Best wishes! 

 

    On Monday, July 18, 2016 6:24 PM, Chanh Le <gi...@gmail.com> wrote:
 

 Hi,What about the network (bandwidth) between hive and spark? Does it run in Hive before then you move to Spark?Because It's complex you can use something like EXPLAIN command to show what going on.



 
On Jul 18, 2016, at 5:20 PM, Zhiliang Zhu <zc...@yahoo.com.INVALID> wrote:
the sql logic in the program is very much complex , so do not describe the detailed codes   here .  

    On Monday, July 18, 2016 6:04 PM, Zhiliang Zhu <zc...@yahoo.com.INVALID> wrote:
 

 Hi All,  
Here we have one application, it needs to extract different columns from 6 hive tables, and then does some easy calculation, there is around 100,000 number of rows in each table,finally need to output another table or file (with format of consistent columns) .
 However, after lots of days trying, the spark hive job is unthinkably slow - sometimes almost frozen. There is 5 nodes for spark cluster.  Could anyone offer some help, some idea or clue is also good. 
Thanks in advance~
Zhiliang

Re: the spark job is so slow - almost frozen

Posted by Andrew Ehrlich <an...@aehrlich.com>.

Try:

- filtering down the data as soon as possible in the job, dropping columns you don’t need.
- processing fewer partitions of the hive tables at a time
- caching frequently accessed data, for example dimension tables, lookup tables, or other datasets that are repeatedly accessed
- using the Spark UI to identify the bottlenecked resource
- remove features or columns from the output data, until it runs, then add them back in one at a time.
- creating a static dataset small enough to work, and editing the query, then retesting, repeatedly until you cut the execution time by a significant fraction
- Using the Spark UI or spark shell to check the skew and make sure partitions are evenly distributed

> On Jul 18, 2016, at 3:33 AM, Zhiliang Zhu <zc...@yahoo.com.INVALID> wrote:
> 
> Thanks a lot for your reply .
> 
> In effect , here we tried to run the sql on kettle, hive and spark hive (by HiveContext) respectively, the job seems frozen  to finish to run .
> 
> In the 6 tables , need to respectively read the different columns in different tables for specific information , then do some simple calculation before output . 
> join operation is used most in the sql . 
> 
> Best wishes! 
> 
> 
> 
> 
> On Monday, July 18, 2016 6:24 PM, Chanh Le <gi...@gmail.com> wrote:
> 
> 
> Hi,
> What about the network (bandwidth) between hive and spark? 
> Does it run in Hive before then you move to Spark?
> Because It's complex you can use something like EXPLAIN command to show what going on.
> 
> 
> 
> 
>  
>> On Jul 18, 2016, at 5:20 PM, Zhiliang Zhu <zchl.jump@yahoo.com.INVALID <ma...@yahoo.com.invalid>> wrote:
>> 
>> the sql logic in the program is very much complex , so do not describe the detailed codes   here . 
>> 
>> 
>> On Monday, July 18, 2016 6:04 PM, Zhiliang Zhu <zchl.jump@yahoo.com.INVALID <ma...@yahoo.com.invalid>> wrote:
>> 
>> 
>> Hi All,  
>> 
>> Here we have one application, it needs to extract different columns from 6 hive tables, and then does some easy calculation, there is around 100,000 number of rows in each table,
>> finally need to output another table or file (with format of consistent columns) .
>> 
>>  However, after lots of days trying, the spark hive job is unthinkably slow - sometimes almost frozen. There is 5 nodes for spark cluster. 
>>  
>> Could anyone offer some help, some idea or clue is also good. 
>> 
>> Thanks in advance~
>> 
>> Zhiliang 
>> 
>> 
> 
> 
>

the spark job is so slow during shuffle - almost frozen

Posted by Zhiliang Zhu <zc...@yahoo.com.INVALID>.

  Show original message 


Hi  All , 
While referring to spark UI , displayed as  198/200 - almost frozen...during shuffle stage of one task, most of the executor is with 0 byte, but just  one executor is with 1 G .
moreover, in the several join operation , some case is like this, one table or pairrdd is only with 40 keys, but the other table is with 10, 000 number keys.....
Then, could it be decided some issue as data skew ...
Any help or comment will be deep appreciated .
Thanks in advance ~ 

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Here we have one application, it needs to extract different columns from 6 hive tables, and then does some easy calculation, there is around 100,000
 number of rows in each table, finally need to output another table or file (with format of consistent  columns) .

 However, after lots of days trying, the spark hive job is unthinkably slow - sometimes almost frozen. There is 5 nodes for spark cluster.

 Could anyone offer some help, some idea or clue is also good.

 Thanks in advance~



On Tuesday, July 19, 2016 11:05 AM, Zhiliang Zhu <zc...@yahoo.com> wrote:
  Show original message 

 

 Hi Mungeol,
Thanks a lot for your help. I will try that. 

    On Tuesday, July 19, 2016 9:21 AM, Mungeol Heo <mu...@gmail.com> wrote:
 

 Try to run a action at a Intermediate stage of your job process. Like
save, insertInto, etc.
Wish it can help you out.

On Mon, Jul 18, 2016 at 7:33 PM, Zhiliang Zhu
<zc...@yahoo.com.invalid> wrote:
> Thanks a lot for your reply .
>
> In effect , here we tried to run the sql on kettle, hive and spark hive (by
> HiveContext) respectively, the job seems frozen  to finish to run .
>
> In the 6 tables , need to respectively read the different columns in
> different tables for specific information , then do some simple calculation
> before output .
> join operation is used most in the sql .
>
> Best wishes!
>
>
>
>
> On Monday, July 18, 2016 6:24 PM, Chanh Le <gi...@gmail.com> wrote:
>
>
> Hi,
> What about the network (bandwidth) between hive and spark?
> Does it run in Hive before then you move to Spark?
> Because It's complex you can use something like EXPLAIN command to show what
> going on.
>
>
>
>
>
>
> On Jul 18, 2016, at 5:20 PM, Zhiliang Zhu <zc...@yahoo.com.INVALID>
> wrote:
>
> the sql logic in the program is very much complex , so do not describe the
> detailed codes  here .
>
>
> On Monday, July 18, 2016 6:04 PM, Zhiliang Zhu <zc...@yahoo.com.INVALID>
> wrote:
>
>
> Hi All,
>
> Here we have one application, it needs to extract different columns from 6
> hive tables, and then does some easy calculation, there is around 100,000
> number of rows in each table,
> finally need to output another table or file (with format of consistent
> columns) .
>
>  However, after lots of days trying, the spark hive job is unthinkably slow
> - sometimes almost frozen. There is 5 nodes for spark cluster.
>
> Could anyone offer some help, some idea or clue is also good.
>
> Thanks in advance~
>
> Zhiliang
>

Re: the spark job is so slow - almost frozen

Posted by Zhiliang Zhu <zc...@yahoo.com.INVALID>.

Thanks a lot for your reply .
In effect , here we tried to run the sql on kettle, hive and spark hive (by HiveContext) respectively, the job seems frozen  to finish to run .
In the 6 tables , need to respectively read the different columns in different tables for specific information , then do some simple calculation before output . join operation is used most in the sql . 
Best wishes! 

 

    On Monday, July 18, 2016 6:24 PM, Chanh Le <gi...@gmail.com> wrote:
 

 Hi,What about the network (bandwidth) between hive and spark? Does it run in Hive before then you move to Spark?Because It's complex you can use something like EXPLAIN command to show what going on.



 
On Jul 18, 2016, at 5:20 PM, Zhiliang Zhu <zc...@yahoo.com.INVALID> wrote:
the sql logic in the program is very much complex , so do not describe the detailed codes   here .  

    On Monday, July 18, 2016 6:04 PM, Zhiliang Zhu <zc...@yahoo.com.INVALID> wrote:
 

 Hi All,  
Here we have one application, it needs to extract different columns from 6 hive tables, and then does some easy calculation, there is around 100,000 number of rows in each table,finally need to output another table or file (with format of consistent columns) .
 However, after lots of days trying, the spark hive job is unthinkably slow - sometimes almost frozen. There is 5 nodes for spark cluster.  Could anyone offer some help, some idea or clue is also good. 
Thanks in advance~
Zhiliang

Re: the spark job is so slow - almost frozen

Posted by Chanh Le <gi...@gmail.com>.

Hi,
What about the network (bandwidth) between hive and spark? 
Does it run in Hive before then you move to Spark?
Because It's complex you can use something like EXPLAIN command to show what going on.




 
> On Jul 18, 2016, at 5:20 PM, Zhiliang Zhu <zc...@yahoo.com.INVALID> wrote:
> 
> the sql logic in the program is very much complex , so do not describe the detailed codes   here . 
> 
> 
> On Monday, July 18, 2016 6:04 PM, Zhiliang Zhu <zc...@yahoo.com.INVALID> wrote:
> 
> 
> Hi All,  
> 
> Here we have one application, it needs to extract different columns from 6 hive tables, and then does some easy calculation, there is around 100,000 number of rows in each table,
> finally need to output another table or file (with format of consistent columns) .
> 
>  However, after lots of days trying, the spark hive job is unthinkably slow - sometimes almost frozen. There is 5 nodes for spark cluster. 
>  
> Could anyone offer some help, some idea or clue is also good. 
> 
> Thanks in advance~
> 
> Zhiliang 
> 
>

Re: the spark job is so slow - almost frozen

Posted by Zhiliang Zhu <zc...@yahoo.com.INVALID>.

the sql logic in the program is very much complex , so do not describe the detailed codes   here .  

    On Monday, July 18, 2016 6:04 PM, Zhiliang Zhu <zc...@yahoo.com.INVALID> wrote:
 

 Hi All,  
Here we have one application, it needs to extract different columns from 6 hive tables, and then does some easy calculation, there is around 100,000 number of rows in each table,finally need to output another table or file (with format of consistent columns) .
 However, after lots of days trying, the spark hive job is unthinkably slow - sometimes almost frozen. There is 5 nodes for spark cluster.  Could anyone offer some help, some idea or clue is also good. 
Thanks in advance~
Zhiliang