You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Sujeet Pardeshi <Su...@sas.com> on 2019/01/10 06:57:46 UTC

Out Of Memory Error

Hi Pals,
I have the below Hive SQL which is hitting the following error "at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.OutOfMemoryError: Java heap space at". It's basically going out of memory. The table on which the query is being hit has 246608473 (246 million) records, its size is around 43 GB's. I am running this sql on a Hadoop cluster which has 4 nodes, every node has 16GB memory and 128 GB disk space. I can definitely increase the memory, can scale up more clusters and try but is there something that I can do to make this query work without having to touch the clusters or the memory?

create table t1_content_pages_agg_by_month stored as orc
as
select * from (
select A.dt
               ,A.year
               ,A.month
               ,A.bouncer
               ,A.visitor_type
               ,A.device_type
               ,A.pg_domain_name
               ,A.pg_page_url
               ,A.class1_id
               ,A.class2_id
               ,A.total_page_view_time
               ,row_number() over ( PARTITION BY A.dt,A.year, A.month, A.bouncer,A.visitor_type,A.device_type) as rank
from content_pages_agg_by_month A
)AA
;


Regards,

Sujeet Singh Pardeshi

Software Specialist

SAS Research and Development (India) Pvt. Ltd.
Level 2A and Level 3, Cybercity, Magarpatta, Hadapsar  Pune, Maharashtra, 411 013
off: +91-20-30418810
[Description: untitled]
 "When the solution is simple, God is answering..."


Re: Out Of Memory Error

Posted by Gopal Vijayaraghavan <go...@apache.org>.
>               ,row_number() over ( PARTITION BY A.dt,A.year, A.month, A.bouncer,A.visitor_type,A.device_type order by A.total_page_view_time desc ) as rank 
from content_pages_agg_by_month A

The row_number() window function is a streaming function, so this should not consume a significant part of memory as part of this operation.

I suspect there's some issue with the build of Hive you are using which is preventing it from using less memory, but I can't be sure.

While the query is running, take a jstack of one of the TezChild instances and then you can possibly file a bug with your vendor & get a patch for the problem.

This particular function was improved significantly in Hive 3.0, by vectorizing the implementation natively

https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/vector/ptf/VectorPTFEvaluatorRowNumber.java#L29
 
Cheers,
Gopal 


RE: Out Of Memory Error

Posted by Sujeet Pardeshi <Su...@sas.com>.
Thanks Jorn. Really appreciate your response. Based on your inputs I have increased the default heap size from 1000MB to 3000 MB. Will let you know the results soon. My hive execution engine is ‘Tez’ and the hive version is ‘Hive 2.1.1-amzn-0’. I don’t have the liberty to switch back to mr. My actual hive query is as below where I want to find the top 50 performers for my partition based on the page view time. You have any other inputs that I can play around with?

create table t1_content_pages_agg_by_month stored as orc
as
select * from (
select A.dt
               ,A.year
               ,A.month
               ,A.bouncer
               ,A.visitor_type
               ,A.device_type
               ,A.pg_domain_name
               ,A.pg_page_url
               ,A.class1_id
               ,A.class2_id
               ,A.total_page_view_time
               ,row_number() over ( PARTITION BY A.dt,A.year, A.month, A.bouncer,A.visitor_type,A.device_type order by A.total_page_view_time desc ) as rank
from content_pages_agg_by_month A
)AA
where rank <= ${TOP_K_VAL};



Regards,

Sujeet Singh Pardeshi

Software Specialist

SAS Research and Development (India) Pvt. Ltd.
Level 2A and Level 3, Cybercity, Magarpatta, Hadapsar  Pune, Maharashtra, 411 013
off: +91-20-30418810
[Description: untitled]
 "When the solution is simple, God is answering…"

From: Jörn Franke <jo...@gmail.com>
Sent: 10 January 2019 PM 01:29
To: user@hive.apache.org
Cc: Shashikant Deore <sh...@sas.com>
Subject: Re: Out Of Memory Error


EXTERNAL
Which hive version and engine?

If it is tez then you can also try mr as an engine set hive.execution.engine=mr that will use less memory. Check also the max heap space configuration on the nodes . Maybe you have physically 16 gb memory but the Java process takes only 4 or so memory.

Maybe your query could be also expressed differently, but I miss background information on the use case.

Last but not least : size on disk != size in memory especially if you go beyond simple queries.

Try also with a smaller subset of the data when you reach the memory limit

Am 10.01.2019 um 07:57 schrieb Sujeet Pardeshi <Su...@sas.com>>:
Hi Pals,
I have the below Hive SQL which is hitting the following error “at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.OutOfMemoryError: Java heap space at”. It’s basically going out of memory. The table on which the query is being hit has 246608473 (246 million) records, its size is around 43 GB’s. I am running this sql on a Hadoop cluster which has 4 nodes, every node has 16GB memory and 128 GB disk space. I can definitely increase the memory, can scale up more clusters and try but is there something that I can do to make this query work without having to touch the clusters or the memory?

create table t1_content_pages_agg_by_month stored as orc
as
select * from (
select A.dt
               ,A.year
               ,A.month
               ,A.bouncer
               ,A.visitor_type
               ,A.device_type
               ,A.pg_domain_name
               ,A.pg_page_url
               ,A.class1_id
               ,A.class2_id
               ,A.total_page_view_time
               ,row_number() over ( PARTITION BY A.dt,A.year, A.month, A.bouncer,A.visitor_type,A.device_type) as rank
from content_pages_agg_by_month A
)AA
;


Regards,

Sujeet Singh Pardeshi

Software Specialist

SAS Research and Development (India) Pvt. Ltd.
Level 2A and Level 3, Cybercity, Magarpatta, Hadapsar  Pune, Maharashtra, 411 013
off: +91-20-30418810
<image001.png>
 "When the solution is simple, God is answering…"


Re: Out Of Memory Error

Posted by Jörn Franke <jo...@gmail.com>.
Which hive version and engine?

If it is tez then you can also try mr as an engine set hive.execution.engine=mr that will use less memory. Check also the max heap space configuration on the nodes . Maybe you have physically 16 gb memory but the Java process takes only 4 or so memory.

Maybe your query could be also expressed differently, but I miss background information on the use case.

Last but not least : size on disk != size in memory especially if you go beyond simple queries.

Try also with a smaller subset of the data when you reach the memory limit 

> Am 10.01.2019 um 07:57 schrieb Sujeet Pardeshi <Su...@sas.com>:
> 
> Hi Pals,
> I have the below Hive SQL which is hitting the following error “at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.OutOfMemoryError: Java heap space at”. It’s basically going out of memory. The table on which the query is being hit has 246608473 (246 million) records, its size is around 43 GB’s. I am running this sql on a Hadoop cluster which has 4 nodes, every node has 16GB memory and 128 GB disk space. I can definitely increase the memory, can scale up more clusters and try but is there something that I can do to make this query work without having to touch the clusters or the memory?
>  
> create table t1_content_pages_agg_by_month stored as orc
> as
> select * from (
> select A.dt
>                ,A.year
>                ,A.month
>                ,A.bouncer
>                ,A.visitor_type
>                ,A.device_type
>                ,A.pg_domain_name
>                ,A.pg_page_url
>                ,A.class1_id
>                ,A.class2_id
>                ,A.total_page_view_time
>                ,row_number() over ( PARTITION BY A.dt,A.year, A.month, A.bouncer,A.visitor_type,A.device_type) as rank
> from content_pages_agg_by_month A
> )AA
> ;
>  
> Regards,
> 
> Sujeet Singh Pardeshi
> 
> Software Specialist
> 
> SAS Research and Development (India) Pvt. Ltd.
> 
> Level 2A and Level 3, Cybercity, Magarpatta, Hadapsar  Pune, Maharashtra, 411 013
> off: +91-20-30418810  
> <image001.png>
>  "When the solution is simple, God is answering…" 
>