You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by sam peng <62...@qq.com> on 2019/03/14 04:34:17 UTC

Hive execute big data in single node with small memory

Hello friends.

I have a problem in handling big data (about 72G)with small memory (32G memory).

To dinstinct 7days data (about 70G) inner join another small table. It always return error.


To test handling big data in hive,I am using select count(*) from the table ,and it return error directly. 



Nodemanager with 26g, map size with 2g
In order control the number of map number.
set mapreduce.input.fileinputformat.split.minsize=10000000;
set mapred.map.tasks=10;  
And all useless.

Yarn-site.xml
yarn.nodemanager.resource.memory-mb 26840
yarn.scheduler.maximum-allocation-mb 2096
yarn.scheduler.minimum-allocation-mb 512
yarn.nodemanager.vmem-pmem-ratio 1.1

Mapred-site.xml
mapreduce.reduce.java.opts -Xmx2048
mapreduce.map.java.opts  -Xmx1024
mapreduce.reduce.memory.mb 2560
mapreduce.map.memory.mb 1536

I wonder that :
1. Can I handling 72G data in 32G memory in single node. How to configure if could.
2. Why do I set map task numbers but cannot work?

Thanks in advance.

Re: Hive execute big data in single node with small memory

Posted by Furcy Pin <pi...@gmail.com>.

Hi, Hive and mapreduce are tools made to handle large volumes of data
(terrabytes) with many executor (generally) using small amounts of RAM (few
gigabytes per executor).
It means that the most common operation consists in streaming data from
disks into the JVM and then back to disk. Meaning it can scale very far. Of
course there are several optimizations that make use of RAM but it looks
like you have plenty enough for what you intend to do.

We can't help you fix your problem if you don't provide more details.
How many nodes do you have on your cluster? How much CPU and RAM per node?
In general the default configuration should work fine for simple use cases.

If you still have problems, please send the complete query that you are
running, a description of you input tables (size etc.) and the result of
the EXPLAIN command applied to your query.

As I wild guess, however, I would say that since you said you were
performing a join, you should start by checking the distribution of your
joining keys in each table. For instance if you have a value that occurs 1M
times in the large table and 1M times in the small table too, you will end
up with one of your reducer getting hit by 1000B rows in the RAM. It's a
very common mistake for beginners and no Big Data tool can handle that.

Hope it helps,

Furcy

On Thu, 14 Mar 2019, 05:35 sam peng, <62...@qq.com> wrote:

> Hello friends.
>
> I have a problem in handling big data (about 72G)with small memory (32G
> memory).
>
> To dinstinct 7days data (about 70G) inner join another small table. It
> always return error.
>
>
> To test handling big data in hive,I am using select count(*) from the
> table ,and it return error directly.
>
>
> Nodemanager with 26g, map size with 2g
> In order control the number of map number.
> set mapreduce.input.fileinputformat.split.minsize=10000000;
> set mapred.map.tasks=10;
> And all useless.
>
> Yarn-site.xml
> yarn.nodemanager.resource.memory-mb 26840
> yarn.scheduler.maximum-allocation-mb 2096
> yarn.scheduler.minimum-allocation-mb 512
> yarn.nodemanager.vmem-pmem-ratio 1.1
>
> Mapred-site.xml
> mapreduce.reduce.java.opts -Xmx2048
> mapreduce.map.java.opts  -Xmx1024
> mapreduce.reduce.memory.mb 2560
> mapreduce.map.memory.mb 1536
>
> I wonder that :
> 1. Can I handling 72G data in 32G memory in single node. How to configure
> if could.
> 2. Why do I set map task numbers but cannot work?
>
> Thanks in advance.
>