You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by imen Megdiche <im...@gmail.com> on 2012/12/13 11:31:38 UTC

Incresing map reduce tasks will increse the time of the cpu does this seem to be correct

Hello,

I am trying to increase the number of map and reduce tasks for a job and
even for the same data size, I noticed that the total time CPU increases but
I thought it would decrease. MapReduce is known for performance calculation,
but I do not see this when i  do these small tests.

What de you thins about this issue ?

Re: Incresing map reduce tasks will increse the time of the cpu does this seem to be correct

Posted by imen Megdiche <im...@gmail.com>.

Ok, but if i would search a solution faor warehousing big data, it s rather
hive a best solution actually. I know that facebook uses Hive.


2012/12/13 Mohammad Tariq <do...@gmail.com>

> I said that because under the hood each query(Hive or Pig) gets converted
> into a MapReduce job first, and gives you the result.
>
> Regards,
>     Mohammad Tariq
>
>
>
> On Thu, Dec 13, 2012 at 7:51 PM, imen Megdiche <im...@gmail.com>wrote:
>
>> I don t  understand why you mean with "Same holds good for Hive or Pig"
>> , do you mean i would rather compare datawarehouses with hive  or Pig.
>> Great, you help me so much. Mohammad.
>>
>>
>> 2012/12/13 Mohammad Tariq <do...@gmail.com>
>>
>>> If you are going to do some OLTP kinda thing, I would not suggest
>>> Hadoop. Same holds good for Hive or Pig
>>
>>
>

Re: Incresing map reduce tasks will increse the time of the cpu does this seem to be correct

Posted by Mohammad Tariq <do...@gmail.com>.

I said that because under the hood each query(Hive or Pig) gets converted
into a MapReduce job first, and gives you the result.

Regards,
    Mohammad Tariq

On Thu, Dec 13, 2012 at 7:51 PM, imen Megdiche <im...@gmail.com>wrote:

> I don t  understand why you mean with "Same holds good for Hive or Pig" ,
> do you mean i would rather compare datawarehouses with hive  or Pig.
> Great, you help me so much. Mohammad.
>
>
> 2012/12/13 Mohammad Tariq <do...@gmail.com>
>
>> If you are going to do some OLTP kinda thing, I would not suggest Hadoop.
>> Same holds good for Hive or Pig
>
>

Re: Incresing map reduce tasks will increse the time of the cpu does this seem to be correct

Posted by imen Megdiche <im...@gmail.com>.

I don t  understand why you mean with "Same holds good for Hive or Pig" ,
do you mean i would rather compare datawarehouses with hive  or Pig.
Great, you help me so much. Mohammad.


2012/12/13 Mohammad Tariq <do...@gmail.com>

> If you are going to do some OLTP kinda thing, I would not suggest Hadoop.
> Same holds good for Hive or Pig

Re: Incresing map reduce tasks will increse the time of the cpu does this seem to be correct

Posted by Mohammad Tariq <do...@gmail.com>.

You are welcome.

First things first. We can never compare Hadoop with traditional warehouse
systems or DBMSs. Both are meant for different purposes.

One small example, consider you have 1G of data, then there is nothing that
could match RDBMSs. You'll get the results instantly, as you have specified
above. Now, suppose your company is doing very good has grown very big and
you have 500TB of data. If you try to process this much data using any
traditional system you would face a lot of difficulty, as these systems
have got poor horizontal scalability. The only thing which you could is
increasing your H/W capacity, which can be done only upto a certain limit.
Now, Hadoop comes into picture here.

You can combine 'N' small machines together and utilize their power
collectively to process your huge data. Basic principle of distributed
computing. Long story short, you cannot evaluate the power of Hadoop on a
small dataset. If you are going to do some OLTP kinda thing, I would not
suggest Hadoop. Same holds good for Hive or Pig. Hadoop is basically a
batch processing system and not meant for realtime stuff.

Now, coming back to your actual question, the no. of mappers depends mainly
on the no. of InputSplits created by the InputFormat you are using to
process you data and the no. of reducers depend on the no of partitions
created after the map phase.

HTH

Regards,
    Mohammad Tariq

On Thu, Dec 13, 2012 at 6:25 PM, imen Megdiche <im...@gmail.com>wrote:

> thank you for your explanantions. I  work in a pseudo distributed mode and
> not in cluster. Does your recommendation are also available  in this mode
> and how can i do to have an execution time increasing in function of the
> nbr of map reduces tasks, if it is possible.
> I don t understand in general how mapreduce is much performant in analysis
> then other systems like the datawarehouses. I have tested for example with
> hive a simple query "select sum(col1) from table1" and the resultts
> abtained with hive is in order of 10 min  and with oracle is in the order
> of 0, 20 min for a size of dat ain the order of 40 MB.
>
> Thank you.
>
>
> 2012/12/13 Mohammad Tariq <do...@gmail.com>
>
>> Hello Imen,
>>
>>       If you have huge no of tasks then the overhead of managing the map
>> and reduce task creation begins to dominate the total job execution time.
>> Also, more tasks means you need more free cpu slots. If the slots are not
>> free then the data block of interest will be moved to some other node where
>> frees lots are available and it will consume time and it is also against
>> the most basic principle of Hadoop i.e data localization. So, the no. of
>> maps and reduces should be raised keeping all the factors in mind,
>> otherwise you may face performance issues.
>>
>> HTH
>>
>>
>> Regards,
>>     Mohammad Tariq
>>
>>
>>
>> On Thu, Dec 13, 2012 at 4:11 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>
>>> If the number of maps or reducers your job launched are more than the
>>> jobqueue/cluster capacity, cpu time will increase
>>> On Dec 13, 2012 4:02 PM, "imen Megdiche" <im...@gmail.com>
>>> wrote:
>>>
>>>> Hello,
>>>>
>>>> I am trying to increase the number of map and reduce tasks for a job
>>>> and even for the same data size, I noticed that the total time CPU
>>>> increases but I thought it would decrease. MapReduce is known for
>>>> performance calculation, but I do not see this when i  do these small
>>>> tests.
>>>>
>>>> What de you thins about this issue ?
>>>>
>>>>
>>
>

Re: Incresing map reduce tasks will increse the time of the cpu does this seem to be correct

Posted by imen Megdiche <im...@gmail.com>.

thank you for your explanantions. I  work in a pseudo distributed mode and
not in cluster. Does your recommendation are also available  in this mode
and how can i do to have an execution time increasing in function of the
nbr of map reduces tasks, if it is possible.
I don t understand in general how mapreduce is much performant in analysis
then other systems like the datawarehouses. I have tested for example with
hive a simple query "select sum(col1) from table1" and the resultts
abtained with hive is in order of 10 min  and with oracle is in the order
of 0, 20 min for a size of dat ain the order of 40 MB.

Thank you.

2012/12/13 Mohammad Tariq <do...@gmail.com>

> Hello Imen,
>
>       If you have huge no of tasks then the overhead of managing the map
> and reduce task creation begins to dominate the total job execution time.
> Also, more tasks means you need more free cpu slots. If the slots are not
> free then the data block of interest will be moved to some other node where
> frees lots are available and it will consume time and it is also against
> the most basic principle of Hadoop i.e data localization. So, the no. of
> maps and reduces should be raised keeping all the factors in mind,
> otherwise you may face performance issues.
>
> HTH
>
>
> Regards,
>     Mohammad Tariq
>
>
>
> On Thu, Dec 13, 2012 at 4:11 PM, Nitin Pawar <ni...@gmail.com>wrote:
>
>> If the number of maps or reducers your job launched are more than the
>> jobqueue/cluster capacity, cpu time will increase
>> On Dec 13, 2012 4:02 PM, "imen Megdiche" <im...@gmail.com> wrote:
>>
>>> Hello,
>>>
>>> I am trying to increase the number of map and reduce tasks for a job and
>>> even for the same data size, I noticed that the total time CPU increases but
>>> I thought it would decrease. MapReduce is known for performance calculation,
>>> but I do not see this when i  do these small tests.
>>>
>>> What de you thins about this issue ?
>>>
>>>
>

Re: Incresing map reduce tasks will increse the time of the cpu does this seem to be correct

Posted by Mohammad Tariq <do...@gmail.com>.

Hello Imen,

      If you have huge no of tasks then the overhead of managing the map
and reduce task creation begins to dominate the total job execution time.
Also, more tasks means you need more free cpu slots. If the slots are not
free then the data block of interest will be moved to some other node where
frees lots are available and it will consume time and it is also against
the most basic principle of Hadoop i.e data localization. So, the no. of
maps and reduces should be raised keeping all the factors in mind,
otherwise you may face performance issues.

HTH


Regards,
    Mohammad Tariq



On Thu, Dec 13, 2012 at 4:11 PM, Nitin Pawar <ni...@gmail.com>wrote:

> If the number of maps or reducers your job launched are more than the
> jobqueue/cluster capacity, cpu time will increase
> On Dec 13, 2012 4:02 PM, "imen Megdiche" <im...@gmail.com> wrote:
>
>> Hello,
>>
>> I am trying to increase the number of map and reduce tasks for a job and
>> even for the same data size, I noticed that the total time CPU increases but
>> I thought it would decrease. MapReduce is known for performance calculation,
>> but I do not see this when i  do these small tests.
>>
>> What de you thins about this issue ?
>>
>>

Re: Incresing map reduce tasks will increse the time of the cpu does this seem to be correct

Posted by Nitin Pawar <ni...@gmail.com>.

If the number of maps or reducers your job launched are more than the
jobqueue/cluster capacity, cpu time will increase
On Dec 13, 2012 4:02 PM, "imen Megdiche" <im...@gmail.com> wrote:

> Hello,
>
> I am trying to increase the number of map and reduce tasks for a job and
> even for the same data size, I noticed that the total time CPU increases but
> I thought it would decrease. MapReduce is known for performance calculation,
> but I do not see this when i  do these small tests.
>
> What de you thins about this issue ?
>
>