You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by praveenesh kumar <pr...@gmail.com> on 2011/04/19 06:51:43 UTC

Hadoop Speed Efficiency ??

Hello everyone,

I am new to hadoop...
I set up a  hadoop cluster of 4 ubuntu systems. ( Hadoop 0.20.2)
and I am running the well known word count (gutenberg) example to test how
fast my hadoop is working..

But whenever I am running wordcount example..I am not able to see any much
processing time difference..
On single node the wordcount is taking the same time.. and on cluster of 4
systems also it is taking almost the same time..

Am I  doing anything wrong here ??
Can anyone explain me why its happening.. and how can I make maximum use of
my cluster ??

Thanks.
Praveenesh

Re: Hadoop Speed Efficiency ??

Posted by praveenesh kumar <pr...@gmail.com>.
So you mean if we want to taste the sweetness of hadoop, we need huge
datasets...
Well I don't have any dataset of this range this time.. Can anyone suggest
any website or have any dataset of this range so that I can see the
efficiency of hadoop..

Thanks,
Praveenesh

On Tue, Apr 19, 2011 at 10:55 AM, Mathias Herberts <
mathias.herberts@gmail.com> wrote:

> Hi Praveeenesh,
>
> On Tue, Apr 19, 2011 at 07:06, praveenesh kumar <pr...@gmail.com>
> wrote:
> > The input were  3  plain text files..
> >
> > 1 file was around 665 KB and other 2 files were around 1.5 MB each..
>
> That's not the Hadoop's sweet spot, with the default block size
> (64Mb), none of those files will be split, meaning they will only be
> processed by a single mapper and thus adding machines won't improve
> performance.
>
> Try with files that span several blocks, you should see a difference
> in perf when adding machines.
>
> Also the penalty of starting mappers and co won't be erased unless you
> have consequently sized files as input.
>
> Mathias.
>

Re: Hadoop Speed Efficiency ??

Posted by Mathias Herberts <ma...@gmail.com>.
Hi Praveeenesh,

On Tue, Apr 19, 2011 at 07:06, praveenesh kumar <pr...@gmail.com> wrote:
> The input were  3  plain text files..
>
> 1 file was around 665 KB and other 2 files were around 1.5 MB each..

That's not the Hadoop's sweet spot, with the default block size
(64Mb), none of those files will be split, meaning they will only be
processed by a single mapper and thus adding machines won't improve
performance.

Try with files that span several blocks, you should see a difference
in perf when adding machines.

Also the penalty of starting mappers and co won't be erased unless you
have consequently sized files as input.

Mathias.

Re: Hadoop Speed Efficiency ??

Posted by "real great.." <gr...@gmail.com>.
To get such huge files just google on "how to create large files."
Using cat command is one such option.

On Tue, Apr 19, 2011 at 11:03 AM, Mehmet Tepedelenlioglu <
mehmetsino@gmail.com> wrote:

> For such small input, the only way you would see speed gains would be if
> your job was dominated
> by cpu time, and not i/o. Since word-count is mostly an i/o problem and
> your
> input size is quite small, you are seeing similar run times. 3 computers is
> better than 1
> only if you need them.
>
> On Apr 18, 2011, at 10:06 PM, praveenesh kumar wrote:
>
> > The input were  3  plain text files..
> >
> > 1 file was around 665 KB and other 2 files were around 1.5 MB each..
> >
> > Thanks,
> > Praveeenesh
> >
> >
> >
> > On Tue, Apr 19, 2011 at 10:27 AM, real great.. <
> greatness.hardness@gmail.com
> >> wrote:
> >
> >> Whats your input size?
> >>
> >> On Tue, Apr 19, 2011 at 10:21 AM, praveenesh kumar <
> praveenesh@gmail.com
> >>> wrote:
> >>
> >>> Hello everyone,
> >>>
> >>> I am new to hadoop...
> >>> I set up a  hadoop cluster of 4 ubuntu systems. ( Hadoop 0.20.2)
> >>> and I am running the well known word count (gutenberg) example to test
> >> how
> >>> fast my hadoop is working..
> >>>
> >>> But whenever I am running wordcount example..I am not able to see any
> >> much
> >>> processing time difference..
> >>> On single node the wordcount is taking the same time.. and on cluster
> of
> >> 4
> >>> systems also it is taking almost the same time..
> >>>
> >>> Am I  doing anything wrong here ??
> >>> Can anyone explain me why its happening.. and how can I make maximum
> use
> >> of
> >>> my cluster ??
> >>>
> >>> Thanks.
> >>> Praveenesh
> >>>
> >>
> >>
> >>
> >> --
> >> Regards,
> >> R.V.
> >>
>
>


-- 
Regards,
R.V.

Re: Hadoop Speed Efficiency ??

Posted by praveenesh kumar <pr...@gmail.com>.
Thanks a lot people.
I will try all these things and hopefully try to get the clear picture..

Regards,
Praveenesh

On Tue, Apr 19, 2011 at 11:38 AM, Mehmet Tepedelenlioglu <
mehmetsino@gmail.com> wrote:

> As was suggested, create your own input and put it into HDFS. You can
> create it
> in your HD and copy it to hdfs by a simple command. Create a list of
> 1000 random "words". Pick from the list randomly a few million times
> and place that into the hdfs in a file or several files whose sizes are 64
> megs or more.
> That should do it. But things that are not CPU intensive and that you can
> fit
> in a RAM will be done quicker in 1 machine than 4. The benefit  starts when
> you have more data than fits the RAM. The M/R gives you a tool for
> gathering values
> by the key and processing them in batches where each set of values that
> corresponds to a
> key can hopefully can fit in some ram. Usually the applications are not to
> make things
> faster, but make things at all.
>
>
> On Apr 18, 2011, at 10:41 PM, praveenesh kumar wrote:
>
> > Thank you guys for clearing my glasses.. now I can see the clean picture
> :-)
> > So how can I test my cluster... Can anyone suggest any scenario or have
> any
> > data set or any website where I can get dataset of this range ??
> >
> > Thanks,
> > Praveenesh
> >
> > On Tue, Apr 19, 2011 at 11:03 AM, Mehmet Tepedelenlioglu <
> > mehmetsino@gmail.com> wrote:
> >
> >> For such small input, the only way you would see speed gains would be if
> >> your job was dominated
> >> by cpu time, and not i/o. Since word-count is mostly an i/o problem and
> >> your
> >> input size is quite small, you are seeing similar run times. 3 computers
> is
> >> better than 1
> >> only if you need them.
> >>
> >> On Apr 18, 2011, at 10:06 PM, praveenesh kumar wrote:
> >>
> >>> The input were  3  plain text files..
> >>>
> >>> 1 file was around 665 KB and other 2 files were around 1.5 MB each..
> >>>
> >>> Thanks,
> >>> Praveeenesh
> >>>
> >>>
> >>>
> >>> On Tue, Apr 19, 2011 at 10:27 AM, real great.. <
> >> greatness.hardness@gmail.com
> >>>> wrote:
> >>>
> >>>> Whats your input size?
> >>>>
> >>>> On Tue, Apr 19, 2011 at 10:21 AM, praveenesh kumar <
> >> praveenesh@gmail.com
> >>>>> wrote:
> >>>>
> >>>>> Hello everyone,
> >>>>>
> >>>>> I am new to hadoop...
> >>>>> I set up a  hadoop cluster of 4 ubuntu systems. ( Hadoop 0.20.2)
> >>>>> and I am running the well known word count (gutenberg) example to
> test
> >>>> how
> >>>>> fast my hadoop is working..
> >>>>>
> >>>>> But whenever I am running wordcount example..I am not able to see any
> >>>> much
> >>>>> processing time difference..
> >>>>> On single node the wordcount is taking the same time.. and on cluster
> >> of
> >>>> 4
> >>>>> systems also it is taking almost the same time..
> >>>>>
> >>>>> Am I  doing anything wrong here ??
> >>>>> Can anyone explain me why its happening.. and how can I make maximum
> >> use
> >>>> of
> >>>>> my cluster ??
> >>>>>
> >>>>> Thanks.
> >>>>> Praveenesh
> >>>>>
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> Regards,
> >>>> R.V.
> >>>>
> >>
> >>
>
>

Re: Hadoop Speed Efficiency ??

Posted by Mehmet Tepedelenlioglu <me...@gmail.com>.
As was suggested, create your own input and put it into HDFS. You can create it
in your HD and copy it to hdfs by a simple command. Create a list of
1000 random "words". Pick from the list randomly a few million times
and place that into the hdfs in a file or several files whose sizes are 64 megs or more.
That should do it. But things that are not CPU intensive and that you can fit
in a RAM will be done quicker in 1 machine than 4. The benefit  starts when
you have more data than fits the RAM. The M/R gives you a tool for gathering values
by the key and processing them in batches where each set of values that corresponds to a
key can hopefully can fit in some ram. Usually the applications are not to make things
faster, but make things at all.


On Apr 18, 2011, at 10:41 PM, praveenesh kumar wrote:

> Thank you guys for clearing my glasses.. now I can see the clean picture :-)
> So how can I test my cluster... Can anyone suggest any scenario or have any
> data set or any website where I can get dataset of this range ??
> 
> Thanks,
> Praveenesh
> 
> On Tue, Apr 19, 2011 at 11:03 AM, Mehmet Tepedelenlioglu <
> mehmetsino@gmail.com> wrote:
> 
>> For such small input, the only way you would see speed gains would be if
>> your job was dominated
>> by cpu time, and not i/o. Since word-count is mostly an i/o problem and
>> your
>> input size is quite small, you are seeing similar run times. 3 computers is
>> better than 1
>> only if you need them.
>> 
>> On Apr 18, 2011, at 10:06 PM, praveenesh kumar wrote:
>> 
>>> The input were  3  plain text files..
>>> 
>>> 1 file was around 665 KB and other 2 files were around 1.5 MB each..
>>> 
>>> Thanks,
>>> Praveeenesh
>>> 
>>> 
>>> 
>>> On Tue, Apr 19, 2011 at 10:27 AM, real great.. <
>> greatness.hardness@gmail.com
>>>> wrote:
>>> 
>>>> Whats your input size?
>>>> 
>>>> On Tue, Apr 19, 2011 at 10:21 AM, praveenesh kumar <
>> praveenesh@gmail.com
>>>>> wrote:
>>>> 
>>>>> Hello everyone,
>>>>> 
>>>>> I am new to hadoop...
>>>>> I set up a  hadoop cluster of 4 ubuntu systems. ( Hadoop 0.20.2)
>>>>> and I am running the well known word count (gutenberg) example to test
>>>> how
>>>>> fast my hadoop is working..
>>>>> 
>>>>> But whenever I am running wordcount example..I am not able to see any
>>>> much
>>>>> processing time difference..
>>>>> On single node the wordcount is taking the same time.. and on cluster
>> of
>>>> 4
>>>>> systems also it is taking almost the same time..
>>>>> 
>>>>> Am I  doing anything wrong here ??
>>>>> Can anyone explain me why its happening.. and how can I make maximum
>> use
>>>> of
>>>>> my cluster ??
>>>>> 
>>>>> Thanks.
>>>>> Praveenesh
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> --
>>>> Regards,
>>>> R.V.
>>>> 
>> 
>> 


Re: Hadoop Speed Efficiency ??

Posted by praveenesh kumar <pr...@gmail.com>.
Thank you guys for clearing my glasses.. now I can see the clean picture :-)
So how can I test my cluster... Can anyone suggest any scenario or have any
data set or any website where I can get dataset of this range ??

Thanks,
Praveenesh

On Tue, Apr 19, 2011 at 11:03 AM, Mehmet Tepedelenlioglu <
mehmetsino@gmail.com> wrote:

> For such small input, the only way you would see speed gains would be if
> your job was dominated
> by cpu time, and not i/o. Since word-count is mostly an i/o problem and
> your
> input size is quite small, you are seeing similar run times. 3 computers is
> better than 1
> only if you need them.
>
> On Apr 18, 2011, at 10:06 PM, praveenesh kumar wrote:
>
> > The input were  3  plain text files..
> >
> > 1 file was around 665 KB and other 2 files were around 1.5 MB each..
> >
> > Thanks,
> > Praveeenesh
> >
> >
> >
> > On Tue, Apr 19, 2011 at 10:27 AM, real great.. <
> greatness.hardness@gmail.com
> >> wrote:
> >
> >> Whats your input size?
> >>
> >> On Tue, Apr 19, 2011 at 10:21 AM, praveenesh kumar <
> praveenesh@gmail.com
> >>> wrote:
> >>
> >>> Hello everyone,
> >>>
> >>> I am new to hadoop...
> >>> I set up a  hadoop cluster of 4 ubuntu systems. ( Hadoop 0.20.2)
> >>> and I am running the well known word count (gutenberg) example to test
> >> how
> >>> fast my hadoop is working..
> >>>
> >>> But whenever I am running wordcount example..I am not able to see any
> >> much
> >>> processing time difference..
> >>> On single node the wordcount is taking the same time.. and on cluster
> of
> >> 4
> >>> systems also it is taking almost the same time..
> >>>
> >>> Am I  doing anything wrong here ??
> >>> Can anyone explain me why its happening.. and how can I make maximum
> use
> >> of
> >>> my cluster ??
> >>>
> >>> Thanks.
> >>> Praveenesh
> >>>
> >>
> >>
> >>
> >> --
> >> Regards,
> >> R.V.
> >>
>
>

Re: Hadoop Speed Efficiency ??

Posted by Mehmet Tepedelenlioglu <me...@gmail.com>.
For such small input, the only way you would see speed gains would be if your job was dominated
by cpu time, and not i/o. Since word-count is mostly an i/o problem and your 
input size is quite small, you are seeing similar run times. 3 computers is better than 1
only if you need them. 

On Apr 18, 2011, at 10:06 PM, praveenesh kumar wrote:

> The input were  3  plain text files..
> 
> 1 file was around 665 KB and other 2 files were around 1.5 MB each..
> 
> Thanks,
> Praveeenesh
> 
> 
> 
> On Tue, Apr 19, 2011 at 10:27 AM, real great.. <greatness.hardness@gmail.com
>> wrote:
> 
>> Whats your input size?
>> 
>> On Tue, Apr 19, 2011 at 10:21 AM, praveenesh kumar <praveenesh@gmail.com
>>> wrote:
>> 
>>> Hello everyone,
>>> 
>>> I am new to hadoop...
>>> I set up a  hadoop cluster of 4 ubuntu systems. ( Hadoop 0.20.2)
>>> and I am running the well known word count (gutenberg) example to test
>> how
>>> fast my hadoop is working..
>>> 
>>> But whenever I am running wordcount example..I am not able to see any
>> much
>>> processing time difference..
>>> On single node the wordcount is taking the same time.. and on cluster of
>> 4
>>> systems also it is taking almost the same time..
>>> 
>>> Am I  doing anything wrong here ??
>>> Can anyone explain me why its happening.. and how can I make maximum use
>> of
>>> my cluster ??
>>> 
>>> Thanks.
>>> Praveenesh
>>> 
>> 
>> 
>> 
>> --
>> Regards,
>> R.V.
>> 


Re: Hadoop Speed Efficiency ??

Posted by praveenesh kumar <pr...@gmail.com>.
The input were  3  plain text files..

1 file was around 665 KB and other 2 files were around 1.5 MB each..

Thanks,
Praveeenesh



On Tue, Apr 19, 2011 at 10:27 AM, real great.. <greatness.hardness@gmail.com
> wrote:

> Whats your input size?
>
> On Tue, Apr 19, 2011 at 10:21 AM, praveenesh kumar <praveenesh@gmail.com
> >wrote:
>
> > Hello everyone,
> >
> > I am new to hadoop...
> > I set up a  hadoop cluster of 4 ubuntu systems. ( Hadoop 0.20.2)
> > and I am running the well known word count (gutenberg) example to test
> how
> > fast my hadoop is working..
> >
> > But whenever I am running wordcount example..I am not able to see any
> much
> > processing time difference..
> > On single node the wordcount is taking the same time.. and on cluster of
> 4
> > systems also it is taking almost the same time..
> >
> > Am I  doing anything wrong here ??
> > Can anyone explain me why its happening.. and how can I make maximum use
> of
> > my cluster ??
> >
> > Thanks.
> > Praveenesh
> >
>
>
>
> --
> Regards,
> R.V.
>

Re: Hadoop Speed Efficiency ??

Posted by "real great.." <gr...@gmail.com>.
Whats your input size?

On Tue, Apr 19, 2011 at 10:21 AM, praveenesh kumar <pr...@gmail.com>wrote:

> Hello everyone,
>
> I am new to hadoop...
> I set up a  hadoop cluster of 4 ubuntu systems. ( Hadoop 0.20.2)
> and I am running the well known word count (gutenberg) example to test how
> fast my hadoop is working..
>
> But whenever I am running wordcount example..I am not able to see any much
> processing time difference..
> On single node the wordcount is taking the same time.. and on cluster of 4
> systems also it is taking almost the same time..
>
> Am I  doing anything wrong here ??
> Can anyone explain me why its happening.. and how can I make maximum use of
> my cluster ??
>
> Thanks.
> Praveenesh
>



-- 
Regards,
R.V.