You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Matthias Kricke <ma...@gmail.com> on 2012/08/13 13:51:55 UTC

how to enhance job start up speed?

Hello all,

I'm using CDH3u3.
If I want to process one File, set to non splitable hadoop starts one
Mapper and no Reducer (thats ok for this test scenario). The Mapper
goes through a configuration step where some variables for the worker
inside the mapper are initialized.
Now the Mapper gives me K,V-pairs, which are lines of an input file. I
process the V with the worker.

When I compare the run time of hadoop to the run time of the same process
in sequentiell manner, I get:

worker time --> same in both cases

case: mapper --> overhead of ~32% to the worker process (same for bigger
chunk size)
case: sequentiell --> overhead of ~15% to the worker process

It shouldn't be that much slower, because of non splitable, the mapper will
be executed where the data is saved by HDFS, won't it?
Where did those 17% go? How to reduce this? Did hadoop needs the whole time
for reading or streaming the data out of HDFS?

I would appreciate your help,

Greetings
mk

Re: how to enhance job start up speed?

Posted by Bejoy KS <be...@gmail.com>.
Hi Matthias

The best approach is to make your file splittable, That would bring in data locality as well. If you are using compression use lzo or snappy with Sequence Files that is splittable. If files are splittable then no need to worry on block locations as the  tasks get generated by MR considering data locality. 


Regards
Bejoy KS

Sent from handheld, please excuse typos.

-----Original Message-----
From: Matthias Kricke <ma...@gmail.com>
Sender: matthias.zengler@gmail.com
Date: Mon, 13 Aug 2012 17:53:55 
To: <us...@hadoop.apache.org>
Reply-To: user@hadoop.apache.org
Subject: Re: how to enhance job start up speed?

@Bejoy KS: Thanks for your advice.

@Bertrand: It is parallelisable, this is just a test case. In later cases
there will be a lot of big files which should be processed completly each
in one map step. We want to minimize the overhead of network traffic. The
idea is to execute some worker (could be different stuff, e.g. wordcount,
linecount, translation etc) at the node where the file is situated.

If I get it right so far, we need to do several things... first chunk size
should be as big as the file. Then the file is on a single node of the
hadoop cluster, am I right? And
set the file to non splitable.

Did you have some more advice? Anyway thanks so far!

Greetings,
MK

2012/8/13 Bertrand Dechoux <de...@gmail.com>

> It was almost what I was getting at but I was not sure about your problem.
> Basically, Hadoop is only adding overhead due to the way your job is
> constructed.
> Now the question is : why do you need a single mapper? Is your need truly
> not 'parallelisable'?
>
> Bertrand
>
>
> On Mon, Aug 13, 2012 at 4:49 PM, Bejoy KS <be...@gmail.com> wrote:
>
>> **
>> Hi Matthais
>>
>> When an mapreduce program is being used there are some extra steps like
>> checking for input and output dir, calclulating input splits, JT assigning
>> TT for executing the task etc.
>>
>> If your file is non splittable , then one map task per file will be
>> generated irrespective of the number of hdfs blocks. Now some blocks will
>> be in a different node than the node where map task is executed so time
>> will be spend here on the network transfer.
>>
>> In your case MR would be a overhead as your file is non splittable hence
>> no parallelism and also there is an overhead of copying blocks to the map
>> task node.
>> Regards
>> Bejoy KS
>>
>> Sent from handheld, please excuse typos.
>> ------------------------------
>> *From: * Matthias Kricke <ma...@gmail.com>
>> *Sender: * matthias.zengler@gmail.com
>> *Date: *Mon, 13 Aug 2012 16:33:06 +0200
>> *To: *<us...@hadoop.apache.org>
>> *ReplyTo: * user@hadoop.apache.org
>> *Subject: *Re: how to enhance job start up speed?
>>
>> Ok, I try to clarify:
>>
>> 1) The worker is the logic inside my mapper and the same for both cases.
>> 2) I have two cases. In the first one I use hadoop to execute my worker
>> and in a second one, I execute my worker without hadoop (simple read of the
>> file).
>>    Now I measured, for both cases, the time the worker and
>> the surroundings need (so i have two values for each case). The worker took
>> the same time in both cases for the same input (this is expected). But the
>> surroundings took 17%  more time when using hadoop.
>> 3) ~  3GB.
>>
>> I want to know how to reduce this difference and where they come from.
>> I hope that helped? If not, feel free to ask again :)
>>
>> Greetings,
>> MK
>>
>> P.S. just for your information, I did the same test with hypertable as
>> well.
>> I got:
>>  * worker without anything: 15% overhead
>>  * worker with hadoop: 32% overhead
>>  * worker with hypertable: 53% overhead
>> Remark: overhead was measured in comparison to the worker. e.g.
>> hypertable uses 53% of the whole process time, while worker uses 47%.
>>
>> 2012/8/13 Bertrand Dechoux <de...@gmail.com>
>>
>>> I am not sure to understand and I guess I am not the only one.
>>>
>>> 1) What's a worker in your context? Only the logic inside your Mapper or
>>> something else?
>>> 2) You should clarify your cases. You seem to have two cases but both
>>> are in overhead so I am assuming there is a baseline? Hadoop vs sequential,
>>> so sequential is not Hadoop?
>>> 3) What are the size of the file?
>>>
>>> Bertrand
>>>
>>>
>>> On Mon, Aug 13, 2012 at 1:51 PM, Matthias Kricke <
>>> matthias.mk.kricke@gmail.com> wrote:
>>>
>>>> Hello all,
>>>>
>>>> I'm using CDH3u3.
>>>> If I want to process one File, set to non splitable hadoop starts one
>>>> Mapper and no Reducer (thats ok for this test scenario). The Mapper
>>>> goes through a configuration step where some variables for the worker
>>>> inside the mapper are initialized.
>>>> Now the Mapper gives me K,V-pairs, which are lines of an input file. I
>>>> process the V with the worker.
>>>>
>>>> When I compare the run time of hadoop to the run time of the same
>>>> process in sequentiell manner, I get:
>>>>
>>>> worker time --> same in both cases
>>>>
>>>> case: mapper --> overhead of ~32% to the worker process (same for
>>>> bigger chunk size)
>>>> case: sequentiell --> overhead of ~15% to the worker process
>>>>
>>>> It shouldn't be that much slower, because of non splitable, the mapper
>>>> will be executed where the data is saved by HDFS, won't it?
>>>> Where did those 17% go? How to reduce this? Did hadoop needs the whole
>>>> time for reading or streaming the data out of HDFS?
>>>>
>>>> I would appreciate your help,
>>>>
>>>> Greetings
>>>> mk
>>>>
>>>>
>>>
>>>
>>> --
>>> Bertrand Dechoux
>>>
>>
>>
>
>
> --
> Bertrand Dechoux
>


Re: how to enhance job start up speed?

Posted by Bejoy KS <be...@gmail.com>.
Hi Matthias

The best approach is to make your file splittable, That would bring in data locality as well. If you are using compression use lzo or snappy with Sequence Files that is splittable. If files are splittable then no need to worry on block locations as the  tasks get generated by MR considering data locality. 


Regards
Bejoy KS

Sent from handheld, please excuse typos.

-----Original Message-----
From: Matthias Kricke <ma...@gmail.com>
Sender: matthias.zengler@gmail.com
Date: Mon, 13 Aug 2012 17:53:55 
To: <us...@hadoop.apache.org>
Reply-To: user@hadoop.apache.org
Subject: Re: how to enhance job start up speed?

@Bejoy KS: Thanks for your advice.

@Bertrand: It is parallelisable, this is just a test case. In later cases
there will be a lot of big files which should be processed completly each
in one map step. We want to minimize the overhead of network traffic. The
idea is to execute some worker (could be different stuff, e.g. wordcount,
linecount, translation etc) at the node where the file is situated.

If I get it right so far, we need to do several things... first chunk size
should be as big as the file. Then the file is on a single node of the
hadoop cluster, am I right? And
set the file to non splitable.

Did you have some more advice? Anyway thanks so far!

Greetings,
MK

2012/8/13 Bertrand Dechoux <de...@gmail.com>

> It was almost what I was getting at but I was not sure about your problem.
> Basically, Hadoop is only adding overhead due to the way your job is
> constructed.
> Now the question is : why do you need a single mapper? Is your need truly
> not 'parallelisable'?
>
> Bertrand
>
>
> On Mon, Aug 13, 2012 at 4:49 PM, Bejoy KS <be...@gmail.com> wrote:
>
>> **
>> Hi Matthais
>>
>> When an mapreduce program is being used there are some extra steps like
>> checking for input and output dir, calclulating input splits, JT assigning
>> TT for executing the task etc.
>>
>> If your file is non splittable , then one map task per file will be
>> generated irrespective of the number of hdfs blocks. Now some blocks will
>> be in a different node than the node where map task is executed so time
>> will be spend here on the network transfer.
>>
>> In your case MR would be a overhead as your file is non splittable hence
>> no parallelism and also there is an overhead of copying blocks to the map
>> task node.
>> Regards
>> Bejoy KS
>>
>> Sent from handheld, please excuse typos.
>> ------------------------------
>> *From: * Matthias Kricke <ma...@gmail.com>
>> *Sender: * matthias.zengler@gmail.com
>> *Date: *Mon, 13 Aug 2012 16:33:06 +0200
>> *To: *<us...@hadoop.apache.org>
>> *ReplyTo: * user@hadoop.apache.org
>> *Subject: *Re: how to enhance job start up speed?
>>
>> Ok, I try to clarify:
>>
>> 1) The worker is the logic inside my mapper and the same for both cases.
>> 2) I have two cases. In the first one I use hadoop to execute my worker
>> and in a second one, I execute my worker without hadoop (simple read of the
>> file).
>>    Now I measured, for both cases, the time the worker and
>> the surroundings need (so i have two values for each case). The worker took
>> the same time in both cases for the same input (this is expected). But the
>> surroundings took 17%  more time when using hadoop.
>> 3) ~  3GB.
>>
>> I want to know how to reduce this difference and where they come from.
>> I hope that helped? If not, feel free to ask again :)
>>
>> Greetings,
>> MK
>>
>> P.S. just for your information, I did the same test with hypertable as
>> well.
>> I got:
>>  * worker without anything: 15% overhead
>>  * worker with hadoop: 32% overhead
>>  * worker with hypertable: 53% overhead
>> Remark: overhead was measured in comparison to the worker. e.g.
>> hypertable uses 53% of the whole process time, while worker uses 47%.
>>
>> 2012/8/13 Bertrand Dechoux <de...@gmail.com>
>>
>>> I am not sure to understand and I guess I am not the only one.
>>>
>>> 1) What's a worker in your context? Only the logic inside your Mapper or
>>> something else?
>>> 2) You should clarify your cases. You seem to have two cases but both
>>> are in overhead so I am assuming there is a baseline? Hadoop vs sequential,
>>> so sequential is not Hadoop?
>>> 3) What are the size of the file?
>>>
>>> Bertrand
>>>
>>>
>>> On Mon, Aug 13, 2012 at 1:51 PM, Matthias Kricke <
>>> matthias.mk.kricke@gmail.com> wrote:
>>>
>>>> Hello all,
>>>>
>>>> I'm using CDH3u3.
>>>> If I want to process one File, set to non splitable hadoop starts one
>>>> Mapper and no Reducer (thats ok for this test scenario). The Mapper
>>>> goes through a configuration step where some variables for the worker
>>>> inside the mapper are initialized.
>>>> Now the Mapper gives me K,V-pairs, which are lines of an input file. I
>>>> process the V with the worker.
>>>>
>>>> When I compare the run time of hadoop to the run time of the same
>>>> process in sequentiell manner, I get:
>>>>
>>>> worker time --> same in both cases
>>>>
>>>> case: mapper --> overhead of ~32% to the worker process (same for
>>>> bigger chunk size)
>>>> case: sequentiell --> overhead of ~15% to the worker process
>>>>
>>>> It shouldn't be that much slower, because of non splitable, the mapper
>>>> will be executed where the data is saved by HDFS, won't it?
>>>> Where did those 17% go? How to reduce this? Did hadoop needs the whole
>>>> time for reading or streaming the data out of HDFS?
>>>>
>>>> I would appreciate your help,
>>>>
>>>> Greetings
>>>> mk
>>>>
>>>>
>>>
>>>
>>> --
>>> Bertrand Dechoux
>>>
>>
>>
>
>
> --
> Bertrand Dechoux
>


Re: how to enhance job start up speed?

Posted by Bejoy KS <be...@gmail.com>.
Hi Matthias

The best approach is to make your file splittable, That would bring in data locality as well. If you are using compression use lzo or snappy with Sequence Files that is splittable. If files are splittable then no need to worry on block locations as the  tasks get generated by MR considering data locality. 


Regards
Bejoy KS

Sent from handheld, please excuse typos.

-----Original Message-----
From: Matthias Kricke <ma...@gmail.com>
Sender: matthias.zengler@gmail.com
Date: Mon, 13 Aug 2012 17:53:55 
To: <us...@hadoop.apache.org>
Reply-To: user@hadoop.apache.org
Subject: Re: how to enhance job start up speed?

@Bejoy KS: Thanks for your advice.

@Bertrand: It is parallelisable, this is just a test case. In later cases
there will be a lot of big files which should be processed completly each
in one map step. We want to minimize the overhead of network traffic. The
idea is to execute some worker (could be different stuff, e.g. wordcount,
linecount, translation etc) at the node where the file is situated.

If I get it right so far, we need to do several things... first chunk size
should be as big as the file. Then the file is on a single node of the
hadoop cluster, am I right? And
set the file to non splitable.

Did you have some more advice? Anyway thanks so far!

Greetings,
MK

2012/8/13 Bertrand Dechoux <de...@gmail.com>

> It was almost what I was getting at but I was not sure about your problem.
> Basically, Hadoop is only adding overhead due to the way your job is
> constructed.
> Now the question is : why do you need a single mapper? Is your need truly
> not 'parallelisable'?
>
> Bertrand
>
>
> On Mon, Aug 13, 2012 at 4:49 PM, Bejoy KS <be...@gmail.com> wrote:
>
>> **
>> Hi Matthais
>>
>> When an mapreduce program is being used there are some extra steps like
>> checking for input and output dir, calclulating input splits, JT assigning
>> TT for executing the task etc.
>>
>> If your file is non splittable , then one map task per file will be
>> generated irrespective of the number of hdfs blocks. Now some blocks will
>> be in a different node than the node where map task is executed so time
>> will be spend here on the network transfer.
>>
>> In your case MR would be a overhead as your file is non splittable hence
>> no parallelism and also there is an overhead of copying blocks to the map
>> task node.
>> Regards
>> Bejoy KS
>>
>> Sent from handheld, please excuse typos.
>> ------------------------------
>> *From: * Matthias Kricke <ma...@gmail.com>
>> *Sender: * matthias.zengler@gmail.com
>> *Date: *Mon, 13 Aug 2012 16:33:06 +0200
>> *To: *<us...@hadoop.apache.org>
>> *ReplyTo: * user@hadoop.apache.org
>> *Subject: *Re: how to enhance job start up speed?
>>
>> Ok, I try to clarify:
>>
>> 1) The worker is the logic inside my mapper and the same for both cases.
>> 2) I have two cases. In the first one I use hadoop to execute my worker
>> and in a second one, I execute my worker without hadoop (simple read of the
>> file).
>>    Now I measured, for both cases, the time the worker and
>> the surroundings need (so i have two values for each case). The worker took
>> the same time in both cases for the same input (this is expected). But the
>> surroundings took 17%  more time when using hadoop.
>> 3) ~  3GB.
>>
>> I want to know how to reduce this difference and where they come from.
>> I hope that helped? If not, feel free to ask again :)
>>
>> Greetings,
>> MK
>>
>> P.S. just for your information, I did the same test with hypertable as
>> well.
>> I got:
>>  * worker without anything: 15% overhead
>>  * worker with hadoop: 32% overhead
>>  * worker with hypertable: 53% overhead
>> Remark: overhead was measured in comparison to the worker. e.g.
>> hypertable uses 53% of the whole process time, while worker uses 47%.
>>
>> 2012/8/13 Bertrand Dechoux <de...@gmail.com>
>>
>>> I am not sure to understand and I guess I am not the only one.
>>>
>>> 1) What's a worker in your context? Only the logic inside your Mapper or
>>> something else?
>>> 2) You should clarify your cases. You seem to have two cases but both
>>> are in overhead so I am assuming there is a baseline? Hadoop vs sequential,
>>> so sequential is not Hadoop?
>>> 3) What are the size of the file?
>>>
>>> Bertrand
>>>
>>>
>>> On Mon, Aug 13, 2012 at 1:51 PM, Matthias Kricke <
>>> matthias.mk.kricke@gmail.com> wrote:
>>>
>>>> Hello all,
>>>>
>>>> I'm using CDH3u3.
>>>> If I want to process one File, set to non splitable hadoop starts one
>>>> Mapper and no Reducer (thats ok for this test scenario). The Mapper
>>>> goes through a configuration step where some variables for the worker
>>>> inside the mapper are initialized.
>>>> Now the Mapper gives me K,V-pairs, which are lines of an input file. I
>>>> process the V with the worker.
>>>>
>>>> When I compare the run time of hadoop to the run time of the same
>>>> process in sequentiell manner, I get:
>>>>
>>>> worker time --> same in both cases
>>>>
>>>> case: mapper --> overhead of ~32% to the worker process (same for
>>>> bigger chunk size)
>>>> case: sequentiell --> overhead of ~15% to the worker process
>>>>
>>>> It shouldn't be that much slower, because of non splitable, the mapper
>>>> will be executed where the data is saved by HDFS, won't it?
>>>> Where did those 17% go? How to reduce this? Did hadoop needs the whole
>>>> time for reading or streaming the data out of HDFS?
>>>>
>>>> I would appreciate your help,
>>>>
>>>> Greetings
>>>> mk
>>>>
>>>>
>>>
>>>
>>> --
>>> Bertrand Dechoux
>>>
>>
>>
>
>
> --
> Bertrand Dechoux
>


Re: how to enhance job start up speed?

Posted by Bertrand Dechoux <de...@gmail.com>.
Seems like you want to misuse Hadoop but maybe I still don't understand
your context.

The standard way would be to split your files into multiples maps. Each map
could profit from data locality. Do a part of the worker stuff in the
mapper and then use a reducer to aggregate all the results (which could be
another part of your worker). That way you would be able to parallelise
your worker logic on a file. You seems to avoid using a reducer in order to
lessen the network traffic. That's a good concern but reducer do have their
use too.

Bertrand


On Mon, Aug 13, 2012 at 5:53 PM, Matthias Kricke <
matthias.mk.kricke@gmail.com> wrote:

> @Bejoy KS: Thanks for your advice.
>
> @Bertrand: It is parallelisable, this is just a test case. In later cases
> there will be a lot of big files which should be processed completly each
> in one map step. We want to minimize the overhead of network traffic. The
> idea is to execute some worker (could be different stuff, e.g. wordcount,
> linecount, translation etc) at the node where the file is situated.
>
> If I get it right so far, we need to do several things... first chunk size
> should be as big as the file. Then the file is on a single node of the
> hadoop cluster, am I right? And
> set the file to non splitable.
>
> Did you have some more advice? Anyway thanks so far!
>
> Greetings,
> MK
>
>
> 2012/8/13 Bertrand Dechoux <de...@gmail.com>
>
>> It was almost what I was getting at but I was not sure about your
>> problem.
>> Basically, Hadoop is only adding overhead due to the way your job is
>> constructed.
>> Now the question is : why do you need a single mapper? Is your need truly
>> not 'parallelisable'?
>>
>> Bertrand
>>
>>
>> On Mon, Aug 13, 2012 at 4:49 PM, Bejoy KS <be...@gmail.com> wrote:
>>
>>> **
>>> Hi Matthais
>>>
>>> When an mapreduce program is being used there are some extra steps like
>>> checking for input and output dir, calclulating input splits, JT assigning
>>> TT for executing the task etc.
>>>
>>> If your file is non splittable , then one map task per file will be
>>> generated irrespective of the number of hdfs blocks. Now some blocks will
>>> be in a different node than the node where map task is executed so time
>>> will be spend here on the network transfer.
>>>
>>> In your case MR would be a overhead as your file is non splittable hence
>>> no parallelism and also there is an overhead of copying blocks to the map
>>> task node.
>>> Regards
>>> Bejoy KS
>>>
>>> Sent from handheld, please excuse typos.
>>> ------------------------------
>>> *From: * Matthias Kricke <ma...@gmail.com>
>>> *Sender: * matthias.zengler@gmail.com
>>> *Date: *Mon, 13 Aug 2012 16:33:06 +0200
>>> *To: *<us...@hadoop.apache.org>
>>> *ReplyTo: * user@hadoop.apache.org
>>> *Subject: *Re: how to enhance job start up speed?
>>>
>>> Ok, I try to clarify:
>>>
>>> 1) The worker is the logic inside my mapper and the same for both cases.
>>> 2) I have two cases. In the first one I use hadoop to execute my worker
>>> and in a second one, I execute my worker without hadoop (simple read of the
>>> file).
>>>    Now I measured, for both cases, the time the worker and
>>> the surroundings need (so i have two values for each case). The worker took
>>> the same time in both cases for the same input (this is expected). But the
>>> surroundings took 17%  more time when using hadoop.
>>> 3) ~  3GB.
>>>
>>> I want to know how to reduce this difference and where they come from.
>>> I hope that helped? If not, feel free to ask again :)
>>>
>>> Greetings,
>>> MK
>>>
>>> P.S. just for your information, I did the same test with hypertable as
>>> well.
>>> I got:
>>>  * worker without anything: 15% overhead
>>>  * worker with hadoop: 32% overhead
>>>  * worker with hypertable: 53% overhead
>>> Remark: overhead was measured in comparison to the worker. e.g.
>>> hypertable uses 53% of the whole process time, while worker uses 47%.
>>>
>>> 2012/8/13 Bertrand Dechoux <de...@gmail.com>
>>>
>>>> I am not sure to understand and I guess I am not the only one.
>>>>
>>>> 1) What's a worker in your context? Only the logic inside your Mapper
>>>> or something else?
>>>> 2) You should clarify your cases. You seem to have two cases but both
>>>> are in overhead so I am assuming there is a baseline? Hadoop vs sequential,
>>>> so sequential is not Hadoop?
>>>> 3) What are the size of the file?
>>>>
>>>> Bertrand
>>>>
>>>>
>>>> On Mon, Aug 13, 2012 at 1:51 PM, Matthias Kricke <
>>>> matthias.mk.kricke@gmail.com> wrote:
>>>>
>>>>> Hello all,
>>>>>
>>>>> I'm using CDH3u3.
>>>>> If I want to process one File, set to non splitable hadoop starts one
>>>>> Mapper and no Reducer (thats ok for this test scenario). The Mapper
>>>>> goes through a configuration step where some variables for the worker
>>>>> inside the mapper are initialized.
>>>>> Now the Mapper gives me K,V-pairs, which are lines of an input file. I
>>>>> process the V with the worker.
>>>>>
>>>>> When I compare the run time of hadoop to the run time of the same
>>>>> process in sequentiell manner, I get:
>>>>>
>>>>> worker time --> same in both cases
>>>>>
>>>>> case: mapper --> overhead of ~32% to the worker process (same for
>>>>> bigger chunk size)
>>>>> case: sequentiell --> overhead of ~15% to the worker process
>>>>>
>>>>> It shouldn't be that much slower, because of non splitable, the mapper
>>>>> will be executed where the data is saved by HDFS, won't it?
>>>>> Where did those 17% go? How to reduce this? Did hadoop needs the whole
>>>>> time for reading or streaming the data out of HDFS?
>>>>>
>>>>> I would appreciate your help,
>>>>>
>>>>> Greetings
>>>>> mk
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Bertrand Dechoux
>>>>
>>>
>>>
>>
>>
>> --
>> Bertrand Dechoux
>>
>
>


-- 
Bertrand Dechoux

Re: how to enhance job start up speed?

Posted by Bertrand Dechoux <de...@gmail.com>.
Seems like you want to misuse Hadoop but maybe I still don't understand
your context.

The standard way would be to split your files into multiples maps. Each map
could profit from data locality. Do a part of the worker stuff in the
mapper and then use a reducer to aggregate all the results (which could be
another part of your worker). That way you would be able to parallelise
your worker logic on a file. You seems to avoid using a reducer in order to
lessen the network traffic. That's a good concern but reducer do have their
use too.

Bertrand


On Mon, Aug 13, 2012 at 5:53 PM, Matthias Kricke <
matthias.mk.kricke@gmail.com> wrote:

> @Bejoy KS: Thanks for your advice.
>
> @Bertrand: It is parallelisable, this is just a test case. In later cases
> there will be a lot of big files which should be processed completly each
> in one map step. We want to minimize the overhead of network traffic. The
> idea is to execute some worker (could be different stuff, e.g. wordcount,
> linecount, translation etc) at the node where the file is situated.
>
> If I get it right so far, we need to do several things... first chunk size
> should be as big as the file. Then the file is on a single node of the
> hadoop cluster, am I right? And
> set the file to non splitable.
>
> Did you have some more advice? Anyway thanks so far!
>
> Greetings,
> MK
>
>
> 2012/8/13 Bertrand Dechoux <de...@gmail.com>
>
>> It was almost what I was getting at but I was not sure about your
>> problem.
>> Basically, Hadoop is only adding overhead due to the way your job is
>> constructed.
>> Now the question is : why do you need a single mapper? Is your need truly
>> not 'parallelisable'?
>>
>> Bertrand
>>
>>
>> On Mon, Aug 13, 2012 at 4:49 PM, Bejoy KS <be...@gmail.com> wrote:
>>
>>> **
>>> Hi Matthais
>>>
>>> When an mapreduce program is being used there are some extra steps like
>>> checking for input and output dir, calclulating input splits, JT assigning
>>> TT for executing the task etc.
>>>
>>> If your file is non splittable , then one map task per file will be
>>> generated irrespective of the number of hdfs blocks. Now some blocks will
>>> be in a different node than the node where map task is executed so time
>>> will be spend here on the network transfer.
>>>
>>> In your case MR would be a overhead as your file is non splittable hence
>>> no parallelism and also there is an overhead of copying blocks to the map
>>> task node.
>>> Regards
>>> Bejoy KS
>>>
>>> Sent from handheld, please excuse typos.
>>> ------------------------------
>>> *From: * Matthias Kricke <ma...@gmail.com>
>>> *Sender: * matthias.zengler@gmail.com
>>> *Date: *Mon, 13 Aug 2012 16:33:06 +0200
>>> *To: *<us...@hadoop.apache.org>
>>> *ReplyTo: * user@hadoop.apache.org
>>> *Subject: *Re: how to enhance job start up speed?
>>>
>>> Ok, I try to clarify:
>>>
>>> 1) The worker is the logic inside my mapper and the same for both cases.
>>> 2) I have two cases. In the first one I use hadoop to execute my worker
>>> and in a second one, I execute my worker without hadoop (simple read of the
>>> file).
>>>    Now I measured, for both cases, the time the worker and
>>> the surroundings need (so i have two values for each case). The worker took
>>> the same time in both cases for the same input (this is expected). But the
>>> surroundings took 17%  more time when using hadoop.
>>> 3) ~  3GB.
>>>
>>> I want to know how to reduce this difference and where they come from.
>>> I hope that helped? If not, feel free to ask again :)
>>>
>>> Greetings,
>>> MK
>>>
>>> P.S. just for your information, I did the same test with hypertable as
>>> well.
>>> I got:
>>>  * worker without anything: 15% overhead
>>>  * worker with hadoop: 32% overhead
>>>  * worker with hypertable: 53% overhead
>>> Remark: overhead was measured in comparison to the worker. e.g.
>>> hypertable uses 53% of the whole process time, while worker uses 47%.
>>>
>>> 2012/8/13 Bertrand Dechoux <de...@gmail.com>
>>>
>>>> I am not sure to understand and I guess I am not the only one.
>>>>
>>>> 1) What's a worker in your context? Only the logic inside your Mapper
>>>> or something else?
>>>> 2) You should clarify your cases. You seem to have two cases but both
>>>> are in overhead so I am assuming there is a baseline? Hadoop vs sequential,
>>>> so sequential is not Hadoop?
>>>> 3) What are the size of the file?
>>>>
>>>> Bertrand
>>>>
>>>>
>>>> On Mon, Aug 13, 2012 at 1:51 PM, Matthias Kricke <
>>>> matthias.mk.kricke@gmail.com> wrote:
>>>>
>>>>> Hello all,
>>>>>
>>>>> I'm using CDH3u3.
>>>>> If I want to process one File, set to non splitable hadoop starts one
>>>>> Mapper and no Reducer (thats ok for this test scenario). The Mapper
>>>>> goes through a configuration step where some variables for the worker
>>>>> inside the mapper are initialized.
>>>>> Now the Mapper gives me K,V-pairs, which are lines of an input file. I
>>>>> process the V with the worker.
>>>>>
>>>>> When I compare the run time of hadoop to the run time of the same
>>>>> process in sequentiell manner, I get:
>>>>>
>>>>> worker time --> same in both cases
>>>>>
>>>>> case: mapper --> overhead of ~32% to the worker process (same for
>>>>> bigger chunk size)
>>>>> case: sequentiell --> overhead of ~15% to the worker process
>>>>>
>>>>> It shouldn't be that much slower, because of non splitable, the mapper
>>>>> will be executed where the data is saved by HDFS, won't it?
>>>>> Where did those 17% go? How to reduce this? Did hadoop needs the whole
>>>>> time for reading or streaming the data out of HDFS?
>>>>>
>>>>> I would appreciate your help,
>>>>>
>>>>> Greetings
>>>>> mk
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Bertrand Dechoux
>>>>
>>>
>>>
>>
>>
>> --
>> Bertrand Dechoux
>>
>
>


-- 
Bertrand Dechoux

Re: how to enhance job start up speed?

Posted by Bertrand Dechoux <de...@gmail.com>.
Seems like you want to misuse Hadoop but maybe I still don't understand
your context.

The standard way would be to split your files into multiples maps. Each map
could profit from data locality. Do a part of the worker stuff in the
mapper and then use a reducer to aggregate all the results (which could be
another part of your worker). That way you would be able to parallelise
your worker logic on a file. You seems to avoid using a reducer in order to
lessen the network traffic. That's a good concern but reducer do have their
use too.

Bertrand


On Mon, Aug 13, 2012 at 5:53 PM, Matthias Kricke <
matthias.mk.kricke@gmail.com> wrote:

> @Bejoy KS: Thanks for your advice.
>
> @Bertrand: It is parallelisable, this is just a test case. In later cases
> there will be a lot of big files which should be processed completly each
> in one map step. We want to minimize the overhead of network traffic. The
> idea is to execute some worker (could be different stuff, e.g. wordcount,
> linecount, translation etc) at the node where the file is situated.
>
> If I get it right so far, we need to do several things... first chunk size
> should be as big as the file. Then the file is on a single node of the
> hadoop cluster, am I right? And
> set the file to non splitable.
>
> Did you have some more advice? Anyway thanks so far!
>
> Greetings,
> MK
>
>
> 2012/8/13 Bertrand Dechoux <de...@gmail.com>
>
>> It was almost what I was getting at but I was not sure about your
>> problem.
>> Basically, Hadoop is only adding overhead due to the way your job is
>> constructed.
>> Now the question is : why do you need a single mapper? Is your need truly
>> not 'parallelisable'?
>>
>> Bertrand
>>
>>
>> On Mon, Aug 13, 2012 at 4:49 PM, Bejoy KS <be...@gmail.com> wrote:
>>
>>> **
>>> Hi Matthais
>>>
>>> When an mapreduce program is being used there are some extra steps like
>>> checking for input and output dir, calclulating input splits, JT assigning
>>> TT for executing the task etc.
>>>
>>> If your file is non splittable , then one map task per file will be
>>> generated irrespective of the number of hdfs blocks. Now some blocks will
>>> be in a different node than the node where map task is executed so time
>>> will be spend here on the network transfer.
>>>
>>> In your case MR would be a overhead as your file is non splittable hence
>>> no parallelism and also there is an overhead of copying blocks to the map
>>> task node.
>>> Regards
>>> Bejoy KS
>>>
>>> Sent from handheld, please excuse typos.
>>> ------------------------------
>>> *From: * Matthias Kricke <ma...@gmail.com>
>>> *Sender: * matthias.zengler@gmail.com
>>> *Date: *Mon, 13 Aug 2012 16:33:06 +0200
>>> *To: *<us...@hadoop.apache.org>
>>> *ReplyTo: * user@hadoop.apache.org
>>> *Subject: *Re: how to enhance job start up speed?
>>>
>>> Ok, I try to clarify:
>>>
>>> 1) The worker is the logic inside my mapper and the same for both cases.
>>> 2) I have two cases. In the first one I use hadoop to execute my worker
>>> and in a second one, I execute my worker without hadoop (simple read of the
>>> file).
>>>    Now I measured, for both cases, the time the worker and
>>> the surroundings need (so i have two values for each case). The worker took
>>> the same time in both cases for the same input (this is expected). But the
>>> surroundings took 17%  more time when using hadoop.
>>> 3) ~  3GB.
>>>
>>> I want to know how to reduce this difference and where they come from.
>>> I hope that helped? If not, feel free to ask again :)
>>>
>>> Greetings,
>>> MK
>>>
>>> P.S. just for your information, I did the same test with hypertable as
>>> well.
>>> I got:
>>>  * worker without anything: 15% overhead
>>>  * worker with hadoop: 32% overhead
>>>  * worker with hypertable: 53% overhead
>>> Remark: overhead was measured in comparison to the worker. e.g.
>>> hypertable uses 53% of the whole process time, while worker uses 47%.
>>>
>>> 2012/8/13 Bertrand Dechoux <de...@gmail.com>
>>>
>>>> I am not sure to understand and I guess I am not the only one.
>>>>
>>>> 1) What's a worker in your context? Only the logic inside your Mapper
>>>> or something else?
>>>> 2) You should clarify your cases. You seem to have two cases but both
>>>> are in overhead so I am assuming there is a baseline? Hadoop vs sequential,
>>>> so sequential is not Hadoop?
>>>> 3) What are the size of the file?
>>>>
>>>> Bertrand
>>>>
>>>>
>>>> On Mon, Aug 13, 2012 at 1:51 PM, Matthias Kricke <
>>>> matthias.mk.kricke@gmail.com> wrote:
>>>>
>>>>> Hello all,
>>>>>
>>>>> I'm using CDH3u3.
>>>>> If I want to process one File, set to non splitable hadoop starts one
>>>>> Mapper and no Reducer (thats ok for this test scenario). The Mapper
>>>>> goes through a configuration step where some variables for the worker
>>>>> inside the mapper are initialized.
>>>>> Now the Mapper gives me K,V-pairs, which are lines of an input file. I
>>>>> process the V with the worker.
>>>>>
>>>>> When I compare the run time of hadoop to the run time of the same
>>>>> process in sequentiell manner, I get:
>>>>>
>>>>> worker time --> same in both cases
>>>>>
>>>>> case: mapper --> overhead of ~32% to the worker process (same for
>>>>> bigger chunk size)
>>>>> case: sequentiell --> overhead of ~15% to the worker process
>>>>>
>>>>> It shouldn't be that much slower, because of non splitable, the mapper
>>>>> will be executed where the data is saved by HDFS, won't it?
>>>>> Where did those 17% go? How to reduce this? Did hadoop needs the whole
>>>>> time for reading or streaming the data out of HDFS?
>>>>>
>>>>> I would appreciate your help,
>>>>>
>>>>> Greetings
>>>>> mk
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Bertrand Dechoux
>>>>
>>>
>>>
>>
>>
>> --
>> Bertrand Dechoux
>>
>
>


-- 
Bertrand Dechoux

Re: how to enhance job start up speed?

Posted by Bertrand Dechoux <de...@gmail.com>.
Seems like you want to misuse Hadoop but maybe I still don't understand
your context.

The standard way would be to split your files into multiples maps. Each map
could profit from data locality. Do a part of the worker stuff in the
mapper and then use a reducer to aggregate all the results (which could be
another part of your worker). That way you would be able to parallelise
your worker logic on a file. You seems to avoid using a reducer in order to
lessen the network traffic. That's a good concern but reducer do have their
use too.

Bertrand


On Mon, Aug 13, 2012 at 5:53 PM, Matthias Kricke <
matthias.mk.kricke@gmail.com> wrote:

> @Bejoy KS: Thanks for your advice.
>
> @Bertrand: It is parallelisable, this is just a test case. In later cases
> there will be a lot of big files which should be processed completly each
> in one map step. We want to minimize the overhead of network traffic. The
> idea is to execute some worker (could be different stuff, e.g. wordcount,
> linecount, translation etc) at the node where the file is situated.
>
> If I get it right so far, we need to do several things... first chunk size
> should be as big as the file. Then the file is on a single node of the
> hadoop cluster, am I right? And
> set the file to non splitable.
>
> Did you have some more advice? Anyway thanks so far!
>
> Greetings,
> MK
>
>
> 2012/8/13 Bertrand Dechoux <de...@gmail.com>
>
>> It was almost what I was getting at but I was not sure about your
>> problem.
>> Basically, Hadoop is only adding overhead due to the way your job is
>> constructed.
>> Now the question is : why do you need a single mapper? Is your need truly
>> not 'parallelisable'?
>>
>> Bertrand
>>
>>
>> On Mon, Aug 13, 2012 at 4:49 PM, Bejoy KS <be...@gmail.com> wrote:
>>
>>> **
>>> Hi Matthais
>>>
>>> When an mapreduce program is being used there are some extra steps like
>>> checking for input and output dir, calclulating input splits, JT assigning
>>> TT for executing the task etc.
>>>
>>> If your file is non splittable , then one map task per file will be
>>> generated irrespective of the number of hdfs blocks. Now some blocks will
>>> be in a different node than the node where map task is executed so time
>>> will be spend here on the network transfer.
>>>
>>> In your case MR would be a overhead as your file is non splittable hence
>>> no parallelism and also there is an overhead of copying blocks to the map
>>> task node.
>>> Regards
>>> Bejoy KS
>>>
>>> Sent from handheld, please excuse typos.
>>> ------------------------------
>>> *From: * Matthias Kricke <ma...@gmail.com>
>>> *Sender: * matthias.zengler@gmail.com
>>> *Date: *Mon, 13 Aug 2012 16:33:06 +0200
>>> *To: *<us...@hadoop.apache.org>
>>> *ReplyTo: * user@hadoop.apache.org
>>> *Subject: *Re: how to enhance job start up speed?
>>>
>>> Ok, I try to clarify:
>>>
>>> 1) The worker is the logic inside my mapper and the same for both cases.
>>> 2) I have two cases. In the first one I use hadoop to execute my worker
>>> and in a second one, I execute my worker without hadoop (simple read of the
>>> file).
>>>    Now I measured, for both cases, the time the worker and
>>> the surroundings need (so i have two values for each case). The worker took
>>> the same time in both cases for the same input (this is expected). But the
>>> surroundings took 17%  more time when using hadoop.
>>> 3) ~  3GB.
>>>
>>> I want to know how to reduce this difference and where they come from.
>>> I hope that helped? If not, feel free to ask again :)
>>>
>>> Greetings,
>>> MK
>>>
>>> P.S. just for your information, I did the same test with hypertable as
>>> well.
>>> I got:
>>>  * worker without anything: 15% overhead
>>>  * worker with hadoop: 32% overhead
>>>  * worker with hypertable: 53% overhead
>>> Remark: overhead was measured in comparison to the worker. e.g.
>>> hypertable uses 53% of the whole process time, while worker uses 47%.
>>>
>>> 2012/8/13 Bertrand Dechoux <de...@gmail.com>
>>>
>>>> I am not sure to understand and I guess I am not the only one.
>>>>
>>>> 1) What's a worker in your context? Only the logic inside your Mapper
>>>> or something else?
>>>> 2) You should clarify your cases. You seem to have two cases but both
>>>> are in overhead so I am assuming there is a baseline? Hadoop vs sequential,
>>>> so sequential is not Hadoop?
>>>> 3) What are the size of the file?
>>>>
>>>> Bertrand
>>>>
>>>>
>>>> On Mon, Aug 13, 2012 at 1:51 PM, Matthias Kricke <
>>>> matthias.mk.kricke@gmail.com> wrote:
>>>>
>>>>> Hello all,
>>>>>
>>>>> I'm using CDH3u3.
>>>>> If I want to process one File, set to non splitable hadoop starts one
>>>>> Mapper and no Reducer (thats ok for this test scenario). The Mapper
>>>>> goes through a configuration step where some variables for the worker
>>>>> inside the mapper are initialized.
>>>>> Now the Mapper gives me K,V-pairs, which are lines of an input file. I
>>>>> process the V with the worker.
>>>>>
>>>>> When I compare the run time of hadoop to the run time of the same
>>>>> process in sequentiell manner, I get:
>>>>>
>>>>> worker time --> same in both cases
>>>>>
>>>>> case: mapper --> overhead of ~32% to the worker process (same for
>>>>> bigger chunk size)
>>>>> case: sequentiell --> overhead of ~15% to the worker process
>>>>>
>>>>> It shouldn't be that much slower, because of non splitable, the mapper
>>>>> will be executed where the data is saved by HDFS, won't it?
>>>>> Where did those 17% go? How to reduce this? Did hadoop needs the whole
>>>>> time for reading or streaming the data out of HDFS?
>>>>>
>>>>> I would appreciate your help,
>>>>>
>>>>> Greetings
>>>>> mk
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Bertrand Dechoux
>>>>
>>>
>>>
>>
>>
>> --
>> Bertrand Dechoux
>>
>
>


-- 
Bertrand Dechoux

Re: how to enhance job start up speed?

Posted by Bejoy KS <be...@gmail.com>.
Hi Matthias

The best approach is to make your file splittable, That would bring in data locality as well. If you are using compression use lzo or snappy with Sequence Files that is splittable. If files are splittable then no need to worry on block locations as the  tasks get generated by MR considering data locality. 


Regards
Bejoy KS

Sent from handheld, please excuse typos.

-----Original Message-----
From: Matthias Kricke <ma...@gmail.com>
Sender: matthias.zengler@gmail.com
Date: Mon, 13 Aug 2012 17:53:55 
To: <us...@hadoop.apache.org>
Reply-To: user@hadoop.apache.org
Subject: Re: how to enhance job start up speed?

@Bejoy KS: Thanks for your advice.

@Bertrand: It is parallelisable, this is just a test case. In later cases
there will be a lot of big files which should be processed completly each
in one map step. We want to minimize the overhead of network traffic. The
idea is to execute some worker (could be different stuff, e.g. wordcount,
linecount, translation etc) at the node where the file is situated.

If I get it right so far, we need to do several things... first chunk size
should be as big as the file. Then the file is on a single node of the
hadoop cluster, am I right? And
set the file to non splitable.

Did you have some more advice? Anyway thanks so far!

Greetings,
MK

2012/8/13 Bertrand Dechoux <de...@gmail.com>

> It was almost what I was getting at but I was not sure about your problem.
> Basically, Hadoop is only adding overhead due to the way your job is
> constructed.
> Now the question is : why do you need a single mapper? Is your need truly
> not 'parallelisable'?
>
> Bertrand
>
>
> On Mon, Aug 13, 2012 at 4:49 PM, Bejoy KS <be...@gmail.com> wrote:
>
>> **
>> Hi Matthais
>>
>> When an mapreduce program is being used there are some extra steps like
>> checking for input and output dir, calclulating input splits, JT assigning
>> TT for executing the task etc.
>>
>> If your file is non splittable , then one map task per file will be
>> generated irrespective of the number of hdfs blocks. Now some blocks will
>> be in a different node than the node where map task is executed so time
>> will be spend here on the network transfer.
>>
>> In your case MR would be a overhead as your file is non splittable hence
>> no parallelism and also there is an overhead of copying blocks to the map
>> task node.
>> Regards
>> Bejoy KS
>>
>> Sent from handheld, please excuse typos.
>> ------------------------------
>> *From: * Matthias Kricke <ma...@gmail.com>
>> *Sender: * matthias.zengler@gmail.com
>> *Date: *Mon, 13 Aug 2012 16:33:06 +0200
>> *To: *<us...@hadoop.apache.org>
>> *ReplyTo: * user@hadoop.apache.org
>> *Subject: *Re: how to enhance job start up speed?
>>
>> Ok, I try to clarify:
>>
>> 1) The worker is the logic inside my mapper and the same for both cases.
>> 2) I have two cases. In the first one I use hadoop to execute my worker
>> and in a second one, I execute my worker without hadoop (simple read of the
>> file).
>>    Now I measured, for both cases, the time the worker and
>> the surroundings need (so i have two values for each case). The worker took
>> the same time in both cases for the same input (this is expected). But the
>> surroundings took 17%  more time when using hadoop.
>> 3) ~  3GB.
>>
>> I want to know how to reduce this difference and where they come from.
>> I hope that helped? If not, feel free to ask again :)
>>
>> Greetings,
>> MK
>>
>> P.S. just for your information, I did the same test with hypertable as
>> well.
>> I got:
>>  * worker without anything: 15% overhead
>>  * worker with hadoop: 32% overhead
>>  * worker with hypertable: 53% overhead
>> Remark: overhead was measured in comparison to the worker. e.g.
>> hypertable uses 53% of the whole process time, while worker uses 47%.
>>
>> 2012/8/13 Bertrand Dechoux <de...@gmail.com>
>>
>>> I am not sure to understand and I guess I am not the only one.
>>>
>>> 1) What's a worker in your context? Only the logic inside your Mapper or
>>> something else?
>>> 2) You should clarify your cases. You seem to have two cases but both
>>> are in overhead so I am assuming there is a baseline? Hadoop vs sequential,
>>> so sequential is not Hadoop?
>>> 3) What are the size of the file?
>>>
>>> Bertrand
>>>
>>>
>>> On Mon, Aug 13, 2012 at 1:51 PM, Matthias Kricke <
>>> matthias.mk.kricke@gmail.com> wrote:
>>>
>>>> Hello all,
>>>>
>>>> I'm using CDH3u3.
>>>> If I want to process one File, set to non splitable hadoop starts one
>>>> Mapper and no Reducer (thats ok for this test scenario). The Mapper
>>>> goes through a configuration step where some variables for the worker
>>>> inside the mapper are initialized.
>>>> Now the Mapper gives me K,V-pairs, which are lines of an input file. I
>>>> process the V with the worker.
>>>>
>>>> When I compare the run time of hadoop to the run time of the same
>>>> process in sequentiell manner, I get:
>>>>
>>>> worker time --> same in both cases
>>>>
>>>> case: mapper --> overhead of ~32% to the worker process (same for
>>>> bigger chunk size)
>>>> case: sequentiell --> overhead of ~15% to the worker process
>>>>
>>>> It shouldn't be that much slower, because of non splitable, the mapper
>>>> will be executed where the data is saved by HDFS, won't it?
>>>> Where did those 17% go? How to reduce this? Did hadoop needs the whole
>>>> time for reading or streaming the data out of HDFS?
>>>>
>>>> I would appreciate your help,
>>>>
>>>> Greetings
>>>> mk
>>>>
>>>>
>>>
>>>
>>> --
>>> Bertrand Dechoux
>>>
>>
>>
>
>
> --
> Bertrand Dechoux
>


Re: how to enhance job start up speed?

Posted by Matthias Kricke <ma...@gmail.com>.
@Bejoy KS: Thanks for your advice.

@Bertrand: It is parallelisable, this is just a test case. In later cases
there will be a lot of big files which should be processed completly each
in one map step. We want to minimize the overhead of network traffic. The
idea is to execute some worker (could be different stuff, e.g. wordcount,
linecount, translation etc) at the node where the file is situated.

If I get it right so far, we need to do several things... first chunk size
should be as big as the file. Then the file is on a single node of the
hadoop cluster, am I right? And
set the file to non splitable.

Did you have some more advice? Anyway thanks so far!

Greetings,
MK

2012/8/13 Bertrand Dechoux <de...@gmail.com>

> It was almost what I was getting at but I was not sure about your problem.
> Basically, Hadoop is only adding overhead due to the way your job is
> constructed.
> Now the question is : why do you need a single mapper? Is your need truly
> not 'parallelisable'?
>
> Bertrand
>
>
> On Mon, Aug 13, 2012 at 4:49 PM, Bejoy KS <be...@gmail.com> wrote:
>
>> **
>> Hi Matthais
>>
>> When an mapreduce program is being used there are some extra steps like
>> checking for input and output dir, calclulating input splits, JT assigning
>> TT for executing the task etc.
>>
>> If your file is non splittable , then one map task per file will be
>> generated irrespective of the number of hdfs blocks. Now some blocks will
>> be in a different node than the node where map task is executed so time
>> will be spend here on the network transfer.
>>
>> In your case MR would be a overhead as your file is non splittable hence
>> no parallelism and also there is an overhead of copying blocks to the map
>> task node.
>> Regards
>> Bejoy KS
>>
>> Sent from handheld, please excuse typos.
>> ------------------------------
>> *From: * Matthias Kricke <ma...@gmail.com>
>> *Sender: * matthias.zengler@gmail.com
>> *Date: *Mon, 13 Aug 2012 16:33:06 +0200
>> *To: *<us...@hadoop.apache.org>
>> *ReplyTo: * user@hadoop.apache.org
>> *Subject: *Re: how to enhance job start up speed?
>>
>> Ok, I try to clarify:
>>
>> 1) The worker is the logic inside my mapper and the same for both cases.
>> 2) I have two cases. In the first one I use hadoop to execute my worker
>> and in a second one, I execute my worker without hadoop (simple read of the
>> file).
>>    Now I measured, for both cases, the time the worker and
>> the surroundings need (so i have two values for each case). The worker took
>> the same time in both cases for the same input (this is expected). But the
>> surroundings took 17%  more time when using hadoop.
>> 3) ~  3GB.
>>
>> I want to know how to reduce this difference and where they come from.
>> I hope that helped? If not, feel free to ask again :)
>>
>> Greetings,
>> MK
>>
>> P.S. just for your information, I did the same test with hypertable as
>> well.
>> I got:
>>  * worker without anything: 15% overhead
>>  * worker with hadoop: 32% overhead
>>  * worker with hypertable: 53% overhead
>> Remark: overhead was measured in comparison to the worker. e.g.
>> hypertable uses 53% of the whole process time, while worker uses 47%.
>>
>> 2012/8/13 Bertrand Dechoux <de...@gmail.com>
>>
>>> I am not sure to understand and I guess I am not the only one.
>>>
>>> 1) What's a worker in your context? Only the logic inside your Mapper or
>>> something else?
>>> 2) You should clarify your cases. You seem to have two cases but both
>>> are in overhead so I am assuming there is a baseline? Hadoop vs sequential,
>>> so sequential is not Hadoop?
>>> 3) What are the size of the file?
>>>
>>> Bertrand
>>>
>>>
>>> On Mon, Aug 13, 2012 at 1:51 PM, Matthias Kricke <
>>> matthias.mk.kricke@gmail.com> wrote:
>>>
>>>> Hello all,
>>>>
>>>> I'm using CDH3u3.
>>>> If I want to process one File, set to non splitable hadoop starts one
>>>> Mapper and no Reducer (thats ok for this test scenario). The Mapper
>>>> goes through a configuration step where some variables for the worker
>>>> inside the mapper are initialized.
>>>> Now the Mapper gives me K,V-pairs, which are lines of an input file. I
>>>> process the V with the worker.
>>>>
>>>> When I compare the run time of hadoop to the run time of the same
>>>> process in sequentiell manner, I get:
>>>>
>>>> worker time --> same in both cases
>>>>
>>>> case: mapper --> overhead of ~32% to the worker process (same for
>>>> bigger chunk size)
>>>> case: sequentiell --> overhead of ~15% to the worker process
>>>>
>>>> It shouldn't be that much slower, because of non splitable, the mapper
>>>> will be executed where the data is saved by HDFS, won't it?
>>>> Where did those 17% go? How to reduce this? Did hadoop needs the whole
>>>> time for reading or streaming the data out of HDFS?
>>>>
>>>> I would appreciate your help,
>>>>
>>>> Greetings
>>>> mk
>>>>
>>>>
>>>
>>>
>>> --
>>> Bertrand Dechoux
>>>
>>
>>
>
>
> --
> Bertrand Dechoux
>

Re: how to enhance job start up speed?

Posted by Matthias Kricke <ma...@gmail.com>.
@Bejoy KS: Thanks for your advice.

@Bertrand: It is parallelisable, this is just a test case. In later cases
there will be a lot of big files which should be processed completly each
in one map step. We want to minimize the overhead of network traffic. The
idea is to execute some worker (could be different stuff, e.g. wordcount,
linecount, translation etc) at the node where the file is situated.

If I get it right so far, we need to do several things... first chunk size
should be as big as the file. Then the file is on a single node of the
hadoop cluster, am I right? And
set the file to non splitable.

Did you have some more advice? Anyway thanks so far!

Greetings,
MK

2012/8/13 Bertrand Dechoux <de...@gmail.com>

> It was almost what I was getting at but I was not sure about your problem.
> Basically, Hadoop is only adding overhead due to the way your job is
> constructed.
> Now the question is : why do you need a single mapper? Is your need truly
> not 'parallelisable'?
>
> Bertrand
>
>
> On Mon, Aug 13, 2012 at 4:49 PM, Bejoy KS <be...@gmail.com> wrote:
>
>> **
>> Hi Matthais
>>
>> When an mapreduce program is being used there are some extra steps like
>> checking for input and output dir, calclulating input splits, JT assigning
>> TT for executing the task etc.
>>
>> If your file is non splittable , then one map task per file will be
>> generated irrespective of the number of hdfs blocks. Now some blocks will
>> be in a different node than the node where map task is executed so time
>> will be spend here on the network transfer.
>>
>> In your case MR would be a overhead as your file is non splittable hence
>> no parallelism and also there is an overhead of copying blocks to the map
>> task node.
>> Regards
>> Bejoy KS
>>
>> Sent from handheld, please excuse typos.
>> ------------------------------
>> *From: * Matthias Kricke <ma...@gmail.com>
>> *Sender: * matthias.zengler@gmail.com
>> *Date: *Mon, 13 Aug 2012 16:33:06 +0200
>> *To: *<us...@hadoop.apache.org>
>> *ReplyTo: * user@hadoop.apache.org
>> *Subject: *Re: how to enhance job start up speed?
>>
>> Ok, I try to clarify:
>>
>> 1) The worker is the logic inside my mapper and the same for both cases.
>> 2) I have two cases. In the first one I use hadoop to execute my worker
>> and in a second one, I execute my worker without hadoop (simple read of the
>> file).
>>    Now I measured, for both cases, the time the worker and
>> the surroundings need (so i have two values for each case). The worker took
>> the same time in both cases for the same input (this is expected). But the
>> surroundings took 17%  more time when using hadoop.
>> 3) ~  3GB.
>>
>> I want to know how to reduce this difference and where they come from.
>> I hope that helped? If not, feel free to ask again :)
>>
>> Greetings,
>> MK
>>
>> P.S. just for your information, I did the same test with hypertable as
>> well.
>> I got:
>>  * worker without anything: 15% overhead
>>  * worker with hadoop: 32% overhead
>>  * worker with hypertable: 53% overhead
>> Remark: overhead was measured in comparison to the worker. e.g.
>> hypertable uses 53% of the whole process time, while worker uses 47%.
>>
>> 2012/8/13 Bertrand Dechoux <de...@gmail.com>
>>
>>> I am not sure to understand and I guess I am not the only one.
>>>
>>> 1) What's a worker in your context? Only the logic inside your Mapper or
>>> something else?
>>> 2) You should clarify your cases. You seem to have two cases but both
>>> are in overhead so I am assuming there is a baseline? Hadoop vs sequential,
>>> so sequential is not Hadoop?
>>> 3) What are the size of the file?
>>>
>>> Bertrand
>>>
>>>
>>> On Mon, Aug 13, 2012 at 1:51 PM, Matthias Kricke <
>>> matthias.mk.kricke@gmail.com> wrote:
>>>
>>>> Hello all,
>>>>
>>>> I'm using CDH3u3.
>>>> If I want to process one File, set to non splitable hadoop starts one
>>>> Mapper and no Reducer (thats ok for this test scenario). The Mapper
>>>> goes through a configuration step where some variables for the worker
>>>> inside the mapper are initialized.
>>>> Now the Mapper gives me K,V-pairs, which are lines of an input file. I
>>>> process the V with the worker.
>>>>
>>>> When I compare the run time of hadoop to the run time of the same
>>>> process in sequentiell manner, I get:
>>>>
>>>> worker time --> same in both cases
>>>>
>>>> case: mapper --> overhead of ~32% to the worker process (same for
>>>> bigger chunk size)
>>>> case: sequentiell --> overhead of ~15% to the worker process
>>>>
>>>> It shouldn't be that much slower, because of non splitable, the mapper
>>>> will be executed where the data is saved by HDFS, won't it?
>>>> Where did those 17% go? How to reduce this? Did hadoop needs the whole
>>>> time for reading or streaming the data out of HDFS?
>>>>
>>>> I would appreciate your help,
>>>>
>>>> Greetings
>>>> mk
>>>>
>>>>
>>>
>>>
>>> --
>>> Bertrand Dechoux
>>>
>>
>>
>
>
> --
> Bertrand Dechoux
>

Re: how to enhance job start up speed?

Posted by Matthias Kricke <ma...@gmail.com>.
@Bejoy KS: Thanks for your advice.

@Bertrand: It is parallelisable, this is just a test case. In later cases
there will be a lot of big files which should be processed completly each
in one map step. We want to minimize the overhead of network traffic. The
idea is to execute some worker (could be different stuff, e.g. wordcount,
linecount, translation etc) at the node where the file is situated.

If I get it right so far, we need to do several things... first chunk size
should be as big as the file. Then the file is on a single node of the
hadoop cluster, am I right? And
set the file to non splitable.

Did you have some more advice? Anyway thanks so far!

Greetings,
MK

2012/8/13 Bertrand Dechoux <de...@gmail.com>

> It was almost what I was getting at but I was not sure about your problem.
> Basically, Hadoop is only adding overhead due to the way your job is
> constructed.
> Now the question is : why do you need a single mapper? Is your need truly
> not 'parallelisable'?
>
> Bertrand
>
>
> On Mon, Aug 13, 2012 at 4:49 PM, Bejoy KS <be...@gmail.com> wrote:
>
>> **
>> Hi Matthais
>>
>> When an mapreduce program is being used there are some extra steps like
>> checking for input and output dir, calclulating input splits, JT assigning
>> TT for executing the task etc.
>>
>> If your file is non splittable , then one map task per file will be
>> generated irrespective of the number of hdfs blocks. Now some blocks will
>> be in a different node than the node where map task is executed so time
>> will be spend here on the network transfer.
>>
>> In your case MR would be a overhead as your file is non splittable hence
>> no parallelism and also there is an overhead of copying blocks to the map
>> task node.
>> Regards
>> Bejoy KS
>>
>> Sent from handheld, please excuse typos.
>> ------------------------------
>> *From: * Matthias Kricke <ma...@gmail.com>
>> *Sender: * matthias.zengler@gmail.com
>> *Date: *Mon, 13 Aug 2012 16:33:06 +0200
>> *To: *<us...@hadoop.apache.org>
>> *ReplyTo: * user@hadoop.apache.org
>> *Subject: *Re: how to enhance job start up speed?
>>
>> Ok, I try to clarify:
>>
>> 1) The worker is the logic inside my mapper and the same for both cases.
>> 2) I have two cases. In the first one I use hadoop to execute my worker
>> and in a second one, I execute my worker without hadoop (simple read of the
>> file).
>>    Now I measured, for both cases, the time the worker and
>> the surroundings need (so i have two values for each case). The worker took
>> the same time in both cases for the same input (this is expected). But the
>> surroundings took 17%  more time when using hadoop.
>> 3) ~  3GB.
>>
>> I want to know how to reduce this difference and where they come from.
>> I hope that helped? If not, feel free to ask again :)
>>
>> Greetings,
>> MK
>>
>> P.S. just for your information, I did the same test with hypertable as
>> well.
>> I got:
>>  * worker without anything: 15% overhead
>>  * worker with hadoop: 32% overhead
>>  * worker with hypertable: 53% overhead
>> Remark: overhead was measured in comparison to the worker. e.g.
>> hypertable uses 53% of the whole process time, while worker uses 47%.
>>
>> 2012/8/13 Bertrand Dechoux <de...@gmail.com>
>>
>>> I am not sure to understand and I guess I am not the only one.
>>>
>>> 1) What's a worker in your context? Only the logic inside your Mapper or
>>> something else?
>>> 2) You should clarify your cases. You seem to have two cases but both
>>> are in overhead so I am assuming there is a baseline? Hadoop vs sequential,
>>> so sequential is not Hadoop?
>>> 3) What are the size of the file?
>>>
>>> Bertrand
>>>
>>>
>>> On Mon, Aug 13, 2012 at 1:51 PM, Matthias Kricke <
>>> matthias.mk.kricke@gmail.com> wrote:
>>>
>>>> Hello all,
>>>>
>>>> I'm using CDH3u3.
>>>> If I want to process one File, set to non splitable hadoop starts one
>>>> Mapper and no Reducer (thats ok for this test scenario). The Mapper
>>>> goes through a configuration step where some variables for the worker
>>>> inside the mapper are initialized.
>>>> Now the Mapper gives me K,V-pairs, which are lines of an input file. I
>>>> process the V with the worker.
>>>>
>>>> When I compare the run time of hadoop to the run time of the same
>>>> process in sequentiell manner, I get:
>>>>
>>>> worker time --> same in both cases
>>>>
>>>> case: mapper --> overhead of ~32% to the worker process (same for
>>>> bigger chunk size)
>>>> case: sequentiell --> overhead of ~15% to the worker process
>>>>
>>>> It shouldn't be that much slower, because of non splitable, the mapper
>>>> will be executed where the data is saved by HDFS, won't it?
>>>> Where did those 17% go? How to reduce this? Did hadoop needs the whole
>>>> time for reading or streaming the data out of HDFS?
>>>>
>>>> I would appreciate your help,
>>>>
>>>> Greetings
>>>> mk
>>>>
>>>>
>>>
>>>
>>> --
>>> Bertrand Dechoux
>>>
>>
>>
>
>
> --
> Bertrand Dechoux
>

Re: how to enhance job start up speed?

Posted by Matthias Kricke <ma...@gmail.com>.
@Bejoy KS: Thanks for your advice.

@Bertrand: It is parallelisable, this is just a test case. In later cases
there will be a lot of big files which should be processed completly each
in one map step. We want to minimize the overhead of network traffic. The
idea is to execute some worker (could be different stuff, e.g. wordcount,
linecount, translation etc) at the node where the file is situated.

If I get it right so far, we need to do several things... first chunk size
should be as big as the file. Then the file is on a single node of the
hadoop cluster, am I right? And
set the file to non splitable.

Did you have some more advice? Anyway thanks so far!

Greetings,
MK

2012/8/13 Bertrand Dechoux <de...@gmail.com>

> It was almost what I was getting at but I was not sure about your problem.
> Basically, Hadoop is only adding overhead due to the way your job is
> constructed.
> Now the question is : why do you need a single mapper? Is your need truly
> not 'parallelisable'?
>
> Bertrand
>
>
> On Mon, Aug 13, 2012 at 4:49 PM, Bejoy KS <be...@gmail.com> wrote:
>
>> **
>> Hi Matthais
>>
>> When an mapreduce program is being used there are some extra steps like
>> checking for input and output dir, calclulating input splits, JT assigning
>> TT for executing the task etc.
>>
>> If your file is non splittable , then one map task per file will be
>> generated irrespective of the number of hdfs blocks. Now some blocks will
>> be in a different node than the node where map task is executed so time
>> will be spend here on the network transfer.
>>
>> In your case MR would be a overhead as your file is non splittable hence
>> no parallelism and also there is an overhead of copying blocks to the map
>> task node.
>> Regards
>> Bejoy KS
>>
>> Sent from handheld, please excuse typos.
>> ------------------------------
>> *From: * Matthias Kricke <ma...@gmail.com>
>> *Sender: * matthias.zengler@gmail.com
>> *Date: *Mon, 13 Aug 2012 16:33:06 +0200
>> *To: *<us...@hadoop.apache.org>
>> *ReplyTo: * user@hadoop.apache.org
>> *Subject: *Re: how to enhance job start up speed?
>>
>> Ok, I try to clarify:
>>
>> 1) The worker is the logic inside my mapper and the same for both cases.
>> 2) I have two cases. In the first one I use hadoop to execute my worker
>> and in a second one, I execute my worker without hadoop (simple read of the
>> file).
>>    Now I measured, for both cases, the time the worker and
>> the surroundings need (so i have two values for each case). The worker took
>> the same time in both cases for the same input (this is expected). But the
>> surroundings took 17%  more time when using hadoop.
>> 3) ~  3GB.
>>
>> I want to know how to reduce this difference and where they come from.
>> I hope that helped? If not, feel free to ask again :)
>>
>> Greetings,
>> MK
>>
>> P.S. just for your information, I did the same test with hypertable as
>> well.
>> I got:
>>  * worker without anything: 15% overhead
>>  * worker with hadoop: 32% overhead
>>  * worker with hypertable: 53% overhead
>> Remark: overhead was measured in comparison to the worker. e.g.
>> hypertable uses 53% of the whole process time, while worker uses 47%.
>>
>> 2012/8/13 Bertrand Dechoux <de...@gmail.com>
>>
>>> I am not sure to understand and I guess I am not the only one.
>>>
>>> 1) What's a worker in your context? Only the logic inside your Mapper or
>>> something else?
>>> 2) You should clarify your cases. You seem to have two cases but both
>>> are in overhead so I am assuming there is a baseline? Hadoop vs sequential,
>>> so sequential is not Hadoop?
>>> 3) What are the size of the file?
>>>
>>> Bertrand
>>>
>>>
>>> On Mon, Aug 13, 2012 at 1:51 PM, Matthias Kricke <
>>> matthias.mk.kricke@gmail.com> wrote:
>>>
>>>> Hello all,
>>>>
>>>> I'm using CDH3u3.
>>>> If I want to process one File, set to non splitable hadoop starts one
>>>> Mapper and no Reducer (thats ok for this test scenario). The Mapper
>>>> goes through a configuration step where some variables for the worker
>>>> inside the mapper are initialized.
>>>> Now the Mapper gives me K,V-pairs, which are lines of an input file. I
>>>> process the V with the worker.
>>>>
>>>> When I compare the run time of hadoop to the run time of the same
>>>> process in sequentiell manner, I get:
>>>>
>>>> worker time --> same in both cases
>>>>
>>>> case: mapper --> overhead of ~32% to the worker process (same for
>>>> bigger chunk size)
>>>> case: sequentiell --> overhead of ~15% to the worker process
>>>>
>>>> It shouldn't be that much slower, because of non splitable, the mapper
>>>> will be executed where the data is saved by HDFS, won't it?
>>>> Where did those 17% go? How to reduce this? Did hadoop needs the whole
>>>> time for reading or streaming the data out of HDFS?
>>>>
>>>> I would appreciate your help,
>>>>
>>>> Greetings
>>>> mk
>>>>
>>>>
>>>
>>>
>>> --
>>> Bertrand Dechoux
>>>
>>
>>
>
>
> --
> Bertrand Dechoux
>

Re: how to enhance job start up speed?

Posted by Bertrand Dechoux <de...@gmail.com>.
It was almost what I was getting at but I was not sure about your problem.
Basically, Hadoop is only adding overhead due to the way your job is
constructed.
Now the question is : why do you need a single mapper? Is your need truly
not 'parallelisable'?

Bertrand

On Mon, Aug 13, 2012 at 4:49 PM, Bejoy KS <be...@gmail.com> wrote:

> **
> Hi Matthais
>
> When an mapreduce program is being used there are some extra steps like
> checking for input and output dir, calclulating input splits, JT assigning
> TT for executing the task etc.
>
> If your file is non splittable , then one map task per file will be
> generated irrespective of the number of hdfs blocks. Now some blocks will
> be in a different node than the node where map task is executed so time
> will be spend here on the network transfer.
>
> In your case MR would be a overhead as your file is non splittable hence
> no parallelism and also there is an overhead of copying blocks to the map
> task node.
> Regards
> Bejoy KS
>
> Sent from handheld, please excuse typos.
> ------------------------------
> *From: * Matthias Kricke <ma...@gmail.com>
> *Sender: * matthias.zengler@gmail.com
> *Date: *Mon, 13 Aug 2012 16:33:06 +0200
> *To: *<us...@hadoop.apache.org>
> *ReplyTo: * user@hadoop.apache.org
> *Subject: *Re: how to enhance job start up speed?
>
> Ok, I try to clarify:
>
> 1) The worker is the logic inside my mapper and the same for both cases.
> 2) I have two cases. In the first one I use hadoop to execute my worker
> and in a second one, I execute my worker without hadoop (simple read of the
> file).
>    Now I measured, for both cases, the time the worker and
> the surroundings need (so i have two values for each case). The worker took
> the same time in both cases for the same input (this is expected). But the
> surroundings took 17%  more time when using hadoop.
> 3) ~  3GB.
>
> I want to know how to reduce this difference and where they come from.
> I hope that helped? If not, feel free to ask again :)
>
> Greetings,
> MK
>
> P.S. just for your information, I did the same test with hypertable as
> well.
> I got:
>  * worker without anything: 15% overhead
>  * worker with hadoop: 32% overhead
>  * worker with hypertable: 53% overhead
> Remark: overhead was measured in comparison to the worker. e.g. hypertable
> uses 53% of the whole process time, while worker uses 47%.
>
> 2012/8/13 Bertrand Dechoux <de...@gmail.com>
>
>> I am not sure to understand and I guess I am not the only one.
>>
>> 1) What's a worker in your context? Only the logic inside your Mapper or
>> something else?
>> 2) You should clarify your cases. You seem to have two cases but both are
>> in overhead so I am assuming there is a baseline? Hadoop vs sequential, so
>> sequential is not Hadoop?
>> 3) What are the size of the file?
>>
>> Bertrand
>>
>>
>> On Mon, Aug 13, 2012 at 1:51 PM, Matthias Kricke <
>> matthias.mk.kricke@gmail.com> wrote:
>>
>>> Hello all,
>>>
>>> I'm using CDH3u3.
>>> If I want to process one File, set to non splitable hadoop starts one
>>> Mapper and no Reducer (thats ok for this test scenario). The Mapper
>>> goes through a configuration step where some variables for the worker
>>> inside the mapper are initialized.
>>> Now the Mapper gives me K,V-pairs, which are lines of an input file. I
>>> process the V with the worker.
>>>
>>> When I compare the run time of hadoop to the run time of the same
>>> process in sequentiell manner, I get:
>>>
>>> worker time --> same in both cases
>>>
>>> case: mapper --> overhead of ~32% to the worker process (same for bigger
>>> chunk size)
>>> case: sequentiell --> overhead of ~15% to the worker process
>>>
>>> It shouldn't be that much slower, because of non splitable, the mapper
>>> will be executed where the data is saved by HDFS, won't it?
>>> Where did those 17% go? How to reduce this? Did hadoop needs the whole
>>> time for reading or streaming the data out of HDFS?
>>>
>>> I would appreciate your help,
>>>
>>> Greetings
>>> mk
>>>
>>>
>>
>>
>> --
>> Bertrand Dechoux
>>
>
>


-- 
Bertrand Dechoux

Re: how to enhance job start up speed?

Posted by Bertrand Dechoux <de...@gmail.com>.
It was almost what I was getting at but I was not sure about your problem.
Basically, Hadoop is only adding overhead due to the way your job is
constructed.
Now the question is : why do you need a single mapper? Is your need truly
not 'parallelisable'?

Bertrand

On Mon, Aug 13, 2012 at 4:49 PM, Bejoy KS <be...@gmail.com> wrote:

> **
> Hi Matthais
>
> When an mapreduce program is being used there are some extra steps like
> checking for input and output dir, calclulating input splits, JT assigning
> TT for executing the task etc.
>
> If your file is non splittable , then one map task per file will be
> generated irrespective of the number of hdfs blocks. Now some blocks will
> be in a different node than the node where map task is executed so time
> will be spend here on the network transfer.
>
> In your case MR would be a overhead as your file is non splittable hence
> no parallelism and also there is an overhead of copying blocks to the map
> task node.
> Regards
> Bejoy KS
>
> Sent from handheld, please excuse typos.
> ------------------------------
> *From: * Matthias Kricke <ma...@gmail.com>
> *Sender: * matthias.zengler@gmail.com
> *Date: *Mon, 13 Aug 2012 16:33:06 +0200
> *To: *<us...@hadoop.apache.org>
> *ReplyTo: * user@hadoop.apache.org
> *Subject: *Re: how to enhance job start up speed?
>
> Ok, I try to clarify:
>
> 1) The worker is the logic inside my mapper and the same for both cases.
> 2) I have two cases. In the first one I use hadoop to execute my worker
> and in a second one, I execute my worker without hadoop (simple read of the
> file).
>    Now I measured, for both cases, the time the worker and
> the surroundings need (so i have two values for each case). The worker took
> the same time in both cases for the same input (this is expected). But the
> surroundings took 17%  more time when using hadoop.
> 3) ~  3GB.
>
> I want to know how to reduce this difference and where they come from.
> I hope that helped? If not, feel free to ask again :)
>
> Greetings,
> MK
>
> P.S. just for your information, I did the same test with hypertable as
> well.
> I got:
>  * worker without anything: 15% overhead
>  * worker with hadoop: 32% overhead
>  * worker with hypertable: 53% overhead
> Remark: overhead was measured in comparison to the worker. e.g. hypertable
> uses 53% of the whole process time, while worker uses 47%.
>
> 2012/8/13 Bertrand Dechoux <de...@gmail.com>
>
>> I am not sure to understand and I guess I am not the only one.
>>
>> 1) What's a worker in your context? Only the logic inside your Mapper or
>> something else?
>> 2) You should clarify your cases. You seem to have two cases but both are
>> in overhead so I am assuming there is a baseline? Hadoop vs sequential, so
>> sequential is not Hadoop?
>> 3) What are the size of the file?
>>
>> Bertrand
>>
>>
>> On Mon, Aug 13, 2012 at 1:51 PM, Matthias Kricke <
>> matthias.mk.kricke@gmail.com> wrote:
>>
>>> Hello all,
>>>
>>> I'm using CDH3u3.
>>> If I want to process one File, set to non splitable hadoop starts one
>>> Mapper and no Reducer (thats ok for this test scenario). The Mapper
>>> goes through a configuration step where some variables for the worker
>>> inside the mapper are initialized.
>>> Now the Mapper gives me K,V-pairs, which are lines of an input file. I
>>> process the V with the worker.
>>>
>>> When I compare the run time of hadoop to the run time of the same
>>> process in sequentiell manner, I get:
>>>
>>> worker time --> same in both cases
>>>
>>> case: mapper --> overhead of ~32% to the worker process (same for bigger
>>> chunk size)
>>> case: sequentiell --> overhead of ~15% to the worker process
>>>
>>> It shouldn't be that much slower, because of non splitable, the mapper
>>> will be executed where the data is saved by HDFS, won't it?
>>> Where did those 17% go? How to reduce this? Did hadoop needs the whole
>>> time for reading or streaming the data out of HDFS?
>>>
>>> I would appreciate your help,
>>>
>>> Greetings
>>> mk
>>>
>>>
>>
>>
>> --
>> Bertrand Dechoux
>>
>
>


-- 
Bertrand Dechoux

Re: how to enhance job start up speed?

Posted by Bertrand Dechoux <de...@gmail.com>.
It was almost what I was getting at but I was not sure about your problem.
Basically, Hadoop is only adding overhead due to the way your job is
constructed.
Now the question is : why do you need a single mapper? Is your need truly
not 'parallelisable'?

Bertrand

On Mon, Aug 13, 2012 at 4:49 PM, Bejoy KS <be...@gmail.com> wrote:

> **
> Hi Matthais
>
> When an mapreduce program is being used there are some extra steps like
> checking for input and output dir, calclulating input splits, JT assigning
> TT for executing the task etc.
>
> If your file is non splittable , then one map task per file will be
> generated irrespective of the number of hdfs blocks. Now some blocks will
> be in a different node than the node where map task is executed so time
> will be spend here on the network transfer.
>
> In your case MR would be a overhead as your file is non splittable hence
> no parallelism and also there is an overhead of copying blocks to the map
> task node.
> Regards
> Bejoy KS
>
> Sent from handheld, please excuse typos.
> ------------------------------
> *From: * Matthias Kricke <ma...@gmail.com>
> *Sender: * matthias.zengler@gmail.com
> *Date: *Mon, 13 Aug 2012 16:33:06 +0200
> *To: *<us...@hadoop.apache.org>
> *ReplyTo: * user@hadoop.apache.org
> *Subject: *Re: how to enhance job start up speed?
>
> Ok, I try to clarify:
>
> 1) The worker is the logic inside my mapper and the same for both cases.
> 2) I have two cases. In the first one I use hadoop to execute my worker
> and in a second one, I execute my worker without hadoop (simple read of the
> file).
>    Now I measured, for both cases, the time the worker and
> the surroundings need (so i have two values for each case). The worker took
> the same time in both cases for the same input (this is expected). But the
> surroundings took 17%  more time when using hadoop.
> 3) ~  3GB.
>
> I want to know how to reduce this difference and where they come from.
> I hope that helped? If not, feel free to ask again :)
>
> Greetings,
> MK
>
> P.S. just for your information, I did the same test with hypertable as
> well.
> I got:
>  * worker without anything: 15% overhead
>  * worker with hadoop: 32% overhead
>  * worker with hypertable: 53% overhead
> Remark: overhead was measured in comparison to the worker. e.g. hypertable
> uses 53% of the whole process time, while worker uses 47%.
>
> 2012/8/13 Bertrand Dechoux <de...@gmail.com>
>
>> I am not sure to understand and I guess I am not the only one.
>>
>> 1) What's a worker in your context? Only the logic inside your Mapper or
>> something else?
>> 2) You should clarify your cases. You seem to have two cases but both are
>> in overhead so I am assuming there is a baseline? Hadoop vs sequential, so
>> sequential is not Hadoop?
>> 3) What are the size of the file?
>>
>> Bertrand
>>
>>
>> On Mon, Aug 13, 2012 at 1:51 PM, Matthias Kricke <
>> matthias.mk.kricke@gmail.com> wrote:
>>
>>> Hello all,
>>>
>>> I'm using CDH3u3.
>>> If I want to process one File, set to non splitable hadoop starts one
>>> Mapper and no Reducer (thats ok for this test scenario). The Mapper
>>> goes through a configuration step where some variables for the worker
>>> inside the mapper are initialized.
>>> Now the Mapper gives me K,V-pairs, which are lines of an input file. I
>>> process the V with the worker.
>>>
>>> When I compare the run time of hadoop to the run time of the same
>>> process in sequentiell manner, I get:
>>>
>>> worker time --> same in both cases
>>>
>>> case: mapper --> overhead of ~32% to the worker process (same for bigger
>>> chunk size)
>>> case: sequentiell --> overhead of ~15% to the worker process
>>>
>>> It shouldn't be that much slower, because of non splitable, the mapper
>>> will be executed where the data is saved by HDFS, won't it?
>>> Where did those 17% go? How to reduce this? Did hadoop needs the whole
>>> time for reading or streaming the data out of HDFS?
>>>
>>> I would appreciate your help,
>>>
>>> Greetings
>>> mk
>>>
>>>
>>
>>
>> --
>> Bertrand Dechoux
>>
>
>


-- 
Bertrand Dechoux

Re: how to enhance job start up speed?

Posted by Bertrand Dechoux <de...@gmail.com>.
It was almost what I was getting at but I was not sure about your problem.
Basically, Hadoop is only adding overhead due to the way your job is
constructed.
Now the question is : why do you need a single mapper? Is your need truly
not 'parallelisable'?

Bertrand

On Mon, Aug 13, 2012 at 4:49 PM, Bejoy KS <be...@gmail.com> wrote:

> **
> Hi Matthais
>
> When an mapreduce program is being used there are some extra steps like
> checking for input and output dir, calclulating input splits, JT assigning
> TT for executing the task etc.
>
> If your file is non splittable , then one map task per file will be
> generated irrespective of the number of hdfs blocks. Now some blocks will
> be in a different node than the node where map task is executed so time
> will be spend here on the network transfer.
>
> In your case MR would be a overhead as your file is non splittable hence
> no parallelism and also there is an overhead of copying blocks to the map
> task node.
> Regards
> Bejoy KS
>
> Sent from handheld, please excuse typos.
> ------------------------------
> *From: * Matthias Kricke <ma...@gmail.com>
> *Sender: * matthias.zengler@gmail.com
> *Date: *Mon, 13 Aug 2012 16:33:06 +0200
> *To: *<us...@hadoop.apache.org>
> *ReplyTo: * user@hadoop.apache.org
> *Subject: *Re: how to enhance job start up speed?
>
> Ok, I try to clarify:
>
> 1) The worker is the logic inside my mapper and the same for both cases.
> 2) I have two cases. In the first one I use hadoop to execute my worker
> and in a second one, I execute my worker without hadoop (simple read of the
> file).
>    Now I measured, for both cases, the time the worker and
> the surroundings need (so i have two values for each case). The worker took
> the same time in both cases for the same input (this is expected). But the
> surroundings took 17%  more time when using hadoop.
> 3) ~  3GB.
>
> I want to know how to reduce this difference and where they come from.
> I hope that helped? If not, feel free to ask again :)
>
> Greetings,
> MK
>
> P.S. just for your information, I did the same test with hypertable as
> well.
> I got:
>  * worker without anything: 15% overhead
>  * worker with hadoop: 32% overhead
>  * worker with hypertable: 53% overhead
> Remark: overhead was measured in comparison to the worker. e.g. hypertable
> uses 53% of the whole process time, while worker uses 47%.
>
> 2012/8/13 Bertrand Dechoux <de...@gmail.com>
>
>> I am not sure to understand and I guess I am not the only one.
>>
>> 1) What's a worker in your context? Only the logic inside your Mapper or
>> something else?
>> 2) You should clarify your cases. You seem to have two cases but both are
>> in overhead so I am assuming there is a baseline? Hadoop vs sequential, so
>> sequential is not Hadoop?
>> 3) What are the size of the file?
>>
>> Bertrand
>>
>>
>> On Mon, Aug 13, 2012 at 1:51 PM, Matthias Kricke <
>> matthias.mk.kricke@gmail.com> wrote:
>>
>>> Hello all,
>>>
>>> I'm using CDH3u3.
>>> If I want to process one File, set to non splitable hadoop starts one
>>> Mapper and no Reducer (thats ok for this test scenario). The Mapper
>>> goes through a configuration step where some variables for the worker
>>> inside the mapper are initialized.
>>> Now the Mapper gives me K,V-pairs, which are lines of an input file. I
>>> process the V with the worker.
>>>
>>> When I compare the run time of hadoop to the run time of the same
>>> process in sequentiell manner, I get:
>>>
>>> worker time --> same in both cases
>>>
>>> case: mapper --> overhead of ~32% to the worker process (same for bigger
>>> chunk size)
>>> case: sequentiell --> overhead of ~15% to the worker process
>>>
>>> It shouldn't be that much slower, because of non splitable, the mapper
>>> will be executed where the data is saved by HDFS, won't it?
>>> Where did those 17% go? How to reduce this? Did hadoop needs the whole
>>> time for reading or streaming the data out of HDFS?
>>>
>>> I would appreciate your help,
>>>
>>> Greetings
>>> mk
>>>
>>>
>>
>>
>> --
>> Bertrand Dechoux
>>
>
>


-- 
Bertrand Dechoux

Re: how to enhance job start up speed?

Posted by Bejoy KS <be...@gmail.com>.
Hi Matthais

When an mapreduce program is being used there are some extra steps like checking for input and output dir, calclulating input splits, JT assigning TT for executing the task etc.

If your file is non splittable , then one map task per file will be generated irrespective of the number of hdfs blocks. Now some blocks will be in a different node than the node where map task is executed so time will be spend here on the network transfer.

In your case MR would be a overhead as your file is non splittable hence no parallelism and also there is an overhead of copying blocks to the map task node. 

Regards
Bejoy KS

Sent from handheld, please excuse typos.

-----Original Message-----
From: Matthias Kricke <ma...@gmail.com>
Sender: matthias.zengler@gmail.com
Date: Mon, 13 Aug 2012 16:33:06 
To: <us...@hadoop.apache.org>
Reply-To: user@hadoop.apache.org
Subject: Re: how to enhance job start up speed?

Ok, I try to clarify:

1) The worker is the logic inside my mapper and the same for both cases.
2) I have two cases. In the first one I use hadoop to execute my worker and
in a second one, I execute my worker without hadoop (simple read of the
file).
   Now I measured, for both cases, the time the worker and
the surroundings need (so i have two values for each case). The worker took
the same time in both cases for the same input (this is expected). But the
surroundings took 17%  more time when using hadoop.
3) ~  3GB.

I want to know how to reduce this difference and where they come from.
I hope that helped? If not, feel free to ask again :)

Greetings,
MK

P.S. just for your information, I did the same test with hypertable as
well.
I got:
 * worker without anything: 15% overhead
 * worker with hadoop: 32% overhead
 * worker with hypertable: 53% overhead
Remark: overhead was measured in comparison to the worker. e.g. hypertable
uses 53% of the whole process time, while worker uses 47%.

2012/8/13 Bertrand Dechoux <de...@gmail.com>

> I am not sure to understand and I guess I am not the only one.
>
> 1) What's a worker in your context? Only the logic inside your Mapper or
> something else?
> 2) You should clarify your cases. You seem to have two cases but both are
> in overhead so I am assuming there is a baseline? Hadoop vs sequential, so
> sequential is not Hadoop?
> 3) What are the size of the file?
>
> Bertrand
>
>
> On Mon, Aug 13, 2012 at 1:51 PM, Matthias Kricke <
> matthias.mk.kricke@gmail.com> wrote:
>
>> Hello all,
>>
>> I'm using CDH3u3.
>> If I want to process one File, set to non splitable hadoop starts one
>> Mapper and no Reducer (thats ok for this test scenario). The Mapper
>> goes through a configuration step where some variables for the worker
>> inside the mapper are initialized.
>> Now the Mapper gives me K,V-pairs, which are lines of an input file. I
>> process the V with the worker.
>>
>> When I compare the run time of hadoop to the run time of the same process
>> in sequentiell manner, I get:
>>
>> worker time --> same in both cases
>>
>> case: mapper --> overhead of ~32% to the worker process (same for bigger
>> chunk size)
>> case: sequentiell --> overhead of ~15% to the worker process
>>
>> It shouldn't be that much slower, because of non splitable, the mapper
>> will be executed where the data is saved by HDFS, won't it?
>> Where did those 17% go? How to reduce this? Did hadoop needs the whole
>> time for reading or streaming the data out of HDFS?
>>
>> I would appreciate your help,
>>
>> Greetings
>> mk
>>
>>
>
>
> --
> Bertrand Dechoux
>


Re: how to enhance job start up speed?

Posted by Bejoy KS <be...@gmail.com>.
Hi Matthais

When an mapreduce program is being used there are some extra steps like checking for input and output dir, calclulating input splits, JT assigning TT for executing the task etc.

If your file is non splittable , then one map task per file will be generated irrespective of the number of hdfs blocks. Now some blocks will be in a different node than the node where map task is executed so time will be spend here on the network transfer.

In your case MR would be a overhead as your file is non splittable hence no parallelism and also there is an overhead of copying blocks to the map task node. 

Regards
Bejoy KS

Sent from handheld, please excuse typos.

-----Original Message-----
From: Matthias Kricke <ma...@gmail.com>
Sender: matthias.zengler@gmail.com
Date: Mon, 13 Aug 2012 16:33:06 
To: <us...@hadoop.apache.org>
Reply-To: user@hadoop.apache.org
Subject: Re: how to enhance job start up speed?

Ok, I try to clarify:

1) The worker is the logic inside my mapper and the same for both cases.
2) I have two cases. In the first one I use hadoop to execute my worker and
in a second one, I execute my worker without hadoop (simple read of the
file).
   Now I measured, for both cases, the time the worker and
the surroundings need (so i have two values for each case). The worker took
the same time in both cases for the same input (this is expected). But the
surroundings took 17%  more time when using hadoop.
3) ~  3GB.

I want to know how to reduce this difference and where they come from.
I hope that helped? If not, feel free to ask again :)

Greetings,
MK

P.S. just for your information, I did the same test with hypertable as
well.
I got:
 * worker without anything: 15% overhead
 * worker with hadoop: 32% overhead
 * worker with hypertable: 53% overhead
Remark: overhead was measured in comparison to the worker. e.g. hypertable
uses 53% of the whole process time, while worker uses 47%.

2012/8/13 Bertrand Dechoux <de...@gmail.com>

> I am not sure to understand and I guess I am not the only one.
>
> 1) What's a worker in your context? Only the logic inside your Mapper or
> something else?
> 2) You should clarify your cases. You seem to have two cases but both are
> in overhead so I am assuming there is a baseline? Hadoop vs sequential, so
> sequential is not Hadoop?
> 3) What are the size of the file?
>
> Bertrand
>
>
> On Mon, Aug 13, 2012 at 1:51 PM, Matthias Kricke <
> matthias.mk.kricke@gmail.com> wrote:
>
>> Hello all,
>>
>> I'm using CDH3u3.
>> If I want to process one File, set to non splitable hadoop starts one
>> Mapper and no Reducer (thats ok for this test scenario). The Mapper
>> goes through a configuration step where some variables for the worker
>> inside the mapper are initialized.
>> Now the Mapper gives me K,V-pairs, which are lines of an input file. I
>> process the V with the worker.
>>
>> When I compare the run time of hadoop to the run time of the same process
>> in sequentiell manner, I get:
>>
>> worker time --> same in both cases
>>
>> case: mapper --> overhead of ~32% to the worker process (same for bigger
>> chunk size)
>> case: sequentiell --> overhead of ~15% to the worker process
>>
>> It shouldn't be that much slower, because of non splitable, the mapper
>> will be executed where the data is saved by HDFS, won't it?
>> Where did those 17% go? How to reduce this? Did hadoop needs the whole
>> time for reading or streaming the data out of HDFS?
>>
>> I would appreciate your help,
>>
>> Greetings
>> mk
>>
>>
>
>
> --
> Bertrand Dechoux
>


Re: how to enhance job start up speed?

Posted by Bejoy KS <be...@gmail.com>.
Hi Matthais

When an mapreduce program is being used there are some extra steps like checking for input and output dir, calclulating input splits, JT assigning TT for executing the task etc.

If your file is non splittable , then one map task per file will be generated irrespective of the number of hdfs blocks. Now some blocks will be in a different node than the node where map task is executed so time will be spend here on the network transfer.

In your case MR would be a overhead as your file is non splittable hence no parallelism and also there is an overhead of copying blocks to the map task node. 

Regards
Bejoy KS

Sent from handheld, please excuse typos.

-----Original Message-----
From: Matthias Kricke <ma...@gmail.com>
Sender: matthias.zengler@gmail.com
Date: Mon, 13 Aug 2012 16:33:06 
To: <us...@hadoop.apache.org>
Reply-To: user@hadoop.apache.org
Subject: Re: how to enhance job start up speed?

Ok, I try to clarify:

1) The worker is the logic inside my mapper and the same for both cases.
2) I have two cases. In the first one I use hadoop to execute my worker and
in a second one, I execute my worker without hadoop (simple read of the
file).
   Now I measured, for both cases, the time the worker and
the surroundings need (so i have two values for each case). The worker took
the same time in both cases for the same input (this is expected). But the
surroundings took 17%  more time when using hadoop.
3) ~  3GB.

I want to know how to reduce this difference and where they come from.
I hope that helped? If not, feel free to ask again :)

Greetings,
MK

P.S. just for your information, I did the same test with hypertable as
well.
I got:
 * worker without anything: 15% overhead
 * worker with hadoop: 32% overhead
 * worker with hypertable: 53% overhead
Remark: overhead was measured in comparison to the worker. e.g. hypertable
uses 53% of the whole process time, while worker uses 47%.

2012/8/13 Bertrand Dechoux <de...@gmail.com>

> I am not sure to understand and I guess I am not the only one.
>
> 1) What's a worker in your context? Only the logic inside your Mapper or
> something else?
> 2) You should clarify your cases. You seem to have two cases but both are
> in overhead so I am assuming there is a baseline? Hadoop vs sequential, so
> sequential is not Hadoop?
> 3) What are the size of the file?
>
> Bertrand
>
>
> On Mon, Aug 13, 2012 at 1:51 PM, Matthias Kricke <
> matthias.mk.kricke@gmail.com> wrote:
>
>> Hello all,
>>
>> I'm using CDH3u3.
>> If I want to process one File, set to non splitable hadoop starts one
>> Mapper and no Reducer (thats ok for this test scenario). The Mapper
>> goes through a configuration step where some variables for the worker
>> inside the mapper are initialized.
>> Now the Mapper gives me K,V-pairs, which are lines of an input file. I
>> process the V with the worker.
>>
>> When I compare the run time of hadoop to the run time of the same process
>> in sequentiell manner, I get:
>>
>> worker time --> same in both cases
>>
>> case: mapper --> overhead of ~32% to the worker process (same for bigger
>> chunk size)
>> case: sequentiell --> overhead of ~15% to the worker process
>>
>> It shouldn't be that much slower, because of non splitable, the mapper
>> will be executed where the data is saved by HDFS, won't it?
>> Where did those 17% go? How to reduce this? Did hadoop needs the whole
>> time for reading or streaming the data out of HDFS?
>>
>> I would appreciate your help,
>>
>> Greetings
>> mk
>>
>>
>
>
> --
> Bertrand Dechoux
>


Re: how to enhance job start up speed?

Posted by Bejoy KS <be...@gmail.com>.
Hi Matthais

When an mapreduce program is being used there are some extra steps like checking for input and output dir, calclulating input splits, JT assigning TT for executing the task etc.

If your file is non splittable , then one map task per file will be generated irrespective of the number of hdfs blocks. Now some blocks will be in a different node than the node where map task is executed so time will be spend here on the network transfer.

In your case MR would be a overhead as your file is non splittable hence no parallelism and also there is an overhead of copying blocks to the map task node. 

Regards
Bejoy KS

Sent from handheld, please excuse typos.

-----Original Message-----
From: Matthias Kricke <ma...@gmail.com>
Sender: matthias.zengler@gmail.com
Date: Mon, 13 Aug 2012 16:33:06 
To: <us...@hadoop.apache.org>
Reply-To: user@hadoop.apache.org
Subject: Re: how to enhance job start up speed?

Ok, I try to clarify:

1) The worker is the logic inside my mapper and the same for both cases.
2) I have two cases. In the first one I use hadoop to execute my worker and
in a second one, I execute my worker without hadoop (simple read of the
file).
   Now I measured, for both cases, the time the worker and
the surroundings need (so i have two values for each case). The worker took
the same time in both cases for the same input (this is expected). But the
surroundings took 17%  more time when using hadoop.
3) ~  3GB.

I want to know how to reduce this difference and where they come from.
I hope that helped? If not, feel free to ask again :)

Greetings,
MK

P.S. just for your information, I did the same test with hypertable as
well.
I got:
 * worker without anything: 15% overhead
 * worker with hadoop: 32% overhead
 * worker with hypertable: 53% overhead
Remark: overhead was measured in comparison to the worker. e.g. hypertable
uses 53% of the whole process time, while worker uses 47%.

2012/8/13 Bertrand Dechoux <de...@gmail.com>

> I am not sure to understand and I guess I am not the only one.
>
> 1) What's a worker in your context? Only the logic inside your Mapper or
> something else?
> 2) You should clarify your cases. You seem to have two cases but both are
> in overhead so I am assuming there is a baseline? Hadoop vs sequential, so
> sequential is not Hadoop?
> 3) What are the size of the file?
>
> Bertrand
>
>
> On Mon, Aug 13, 2012 at 1:51 PM, Matthias Kricke <
> matthias.mk.kricke@gmail.com> wrote:
>
>> Hello all,
>>
>> I'm using CDH3u3.
>> If I want to process one File, set to non splitable hadoop starts one
>> Mapper and no Reducer (thats ok for this test scenario). The Mapper
>> goes through a configuration step where some variables for the worker
>> inside the mapper are initialized.
>> Now the Mapper gives me K,V-pairs, which are lines of an input file. I
>> process the V with the worker.
>>
>> When I compare the run time of hadoop to the run time of the same process
>> in sequentiell manner, I get:
>>
>> worker time --> same in both cases
>>
>> case: mapper --> overhead of ~32% to the worker process (same for bigger
>> chunk size)
>> case: sequentiell --> overhead of ~15% to the worker process
>>
>> It shouldn't be that much slower, because of non splitable, the mapper
>> will be executed where the data is saved by HDFS, won't it?
>> Where did those 17% go? How to reduce this? Did hadoop needs the whole
>> time for reading or streaming the data out of HDFS?
>>
>> I would appreciate your help,
>>
>> Greetings
>> mk
>>
>>
>
>
> --
> Bertrand Dechoux
>


Re: how to enhance job start up speed?

Posted by Matthias Kricke <ma...@gmail.com>.
Ok, I try to clarify:

1) The worker is the logic inside my mapper and the same for both cases.
2) I have two cases. In the first one I use hadoop to execute my worker and
in a second one, I execute my worker without hadoop (simple read of the
file).
   Now I measured, for both cases, the time the worker and
the surroundings need (so i have two values for each case). The worker took
the same time in both cases for the same input (this is expected). But the
surroundings took 17%  more time when using hadoop.
3) ~  3GB.

I want to know how to reduce this difference and where they come from.
I hope that helped? If not, feel free to ask again :)

Greetings,
MK

P.S. just for your information, I did the same test with hypertable as
well.
I got:
 * worker without anything: 15% overhead
 * worker with hadoop: 32% overhead
 * worker with hypertable: 53% overhead
Remark: overhead was measured in comparison to the worker. e.g. hypertable
uses 53% of the whole process time, while worker uses 47%.

2012/8/13 Bertrand Dechoux <de...@gmail.com>

> I am not sure to understand and I guess I am not the only one.
>
> 1) What's a worker in your context? Only the logic inside your Mapper or
> something else?
> 2) You should clarify your cases. You seem to have two cases but both are
> in overhead so I am assuming there is a baseline? Hadoop vs sequential, so
> sequential is not Hadoop?
> 3) What are the size of the file?
>
> Bertrand
>
>
> On Mon, Aug 13, 2012 at 1:51 PM, Matthias Kricke <
> matthias.mk.kricke@gmail.com> wrote:
>
>> Hello all,
>>
>> I'm using CDH3u3.
>> If I want to process one File, set to non splitable hadoop starts one
>> Mapper and no Reducer (thats ok for this test scenario). The Mapper
>> goes through a configuration step where some variables for the worker
>> inside the mapper are initialized.
>> Now the Mapper gives me K,V-pairs, which are lines of an input file. I
>> process the V with the worker.
>>
>> When I compare the run time of hadoop to the run time of the same process
>> in sequentiell manner, I get:
>>
>> worker time --> same in both cases
>>
>> case: mapper --> overhead of ~32% to the worker process (same for bigger
>> chunk size)
>> case: sequentiell --> overhead of ~15% to the worker process
>>
>> It shouldn't be that much slower, because of non splitable, the mapper
>> will be executed where the data is saved by HDFS, won't it?
>> Where did those 17% go? How to reduce this? Did hadoop needs the whole
>> time for reading or streaming the data out of HDFS?
>>
>> I would appreciate your help,
>>
>> Greetings
>> mk
>>
>>
>
>
> --
> Bertrand Dechoux
>

Re: how to enhance job start up speed?

Posted by Matthias Kricke <ma...@gmail.com>.
Ok, I try to clarify:

1) The worker is the logic inside my mapper and the same for both cases.
2) I have two cases. In the first one I use hadoop to execute my worker and
in a second one, I execute my worker without hadoop (simple read of the
file).
   Now I measured, for both cases, the time the worker and
the surroundings need (so i have two values for each case). The worker took
the same time in both cases for the same input (this is expected). But the
surroundings took 17%  more time when using hadoop.
3) ~  3GB.

I want to know how to reduce this difference and where they come from.
I hope that helped? If not, feel free to ask again :)

Greetings,
MK

P.S. just for your information, I did the same test with hypertable as
well.
I got:
 * worker without anything: 15% overhead
 * worker with hadoop: 32% overhead
 * worker with hypertable: 53% overhead
Remark: overhead was measured in comparison to the worker. e.g. hypertable
uses 53% of the whole process time, while worker uses 47%.

2012/8/13 Bertrand Dechoux <de...@gmail.com>

> I am not sure to understand and I guess I am not the only one.
>
> 1) What's a worker in your context? Only the logic inside your Mapper or
> something else?
> 2) You should clarify your cases. You seem to have two cases but both are
> in overhead so I am assuming there is a baseline? Hadoop vs sequential, so
> sequential is not Hadoop?
> 3) What are the size of the file?
>
> Bertrand
>
>
> On Mon, Aug 13, 2012 at 1:51 PM, Matthias Kricke <
> matthias.mk.kricke@gmail.com> wrote:
>
>> Hello all,
>>
>> I'm using CDH3u3.
>> If I want to process one File, set to non splitable hadoop starts one
>> Mapper and no Reducer (thats ok for this test scenario). The Mapper
>> goes through a configuration step where some variables for the worker
>> inside the mapper are initialized.
>> Now the Mapper gives me K,V-pairs, which are lines of an input file. I
>> process the V with the worker.
>>
>> When I compare the run time of hadoop to the run time of the same process
>> in sequentiell manner, I get:
>>
>> worker time --> same in both cases
>>
>> case: mapper --> overhead of ~32% to the worker process (same for bigger
>> chunk size)
>> case: sequentiell --> overhead of ~15% to the worker process
>>
>> It shouldn't be that much slower, because of non splitable, the mapper
>> will be executed where the data is saved by HDFS, won't it?
>> Where did those 17% go? How to reduce this? Did hadoop needs the whole
>> time for reading or streaming the data out of HDFS?
>>
>> I would appreciate your help,
>>
>> Greetings
>> mk
>>
>>
>
>
> --
> Bertrand Dechoux
>

Re: how to enhance job start up speed?

Posted by Matthias Kricke <ma...@gmail.com>.
Ok, I try to clarify:

1) The worker is the logic inside my mapper and the same for both cases.
2) I have two cases. In the first one I use hadoop to execute my worker and
in a second one, I execute my worker without hadoop (simple read of the
file).
   Now I measured, for both cases, the time the worker and
the surroundings need (so i have two values for each case). The worker took
the same time in both cases for the same input (this is expected). But the
surroundings took 17%  more time when using hadoop.
3) ~  3GB.

I want to know how to reduce this difference and where they come from.
I hope that helped? If not, feel free to ask again :)

Greetings,
MK

P.S. just for your information, I did the same test with hypertable as
well.
I got:
 * worker without anything: 15% overhead
 * worker with hadoop: 32% overhead
 * worker with hypertable: 53% overhead
Remark: overhead was measured in comparison to the worker. e.g. hypertable
uses 53% of the whole process time, while worker uses 47%.

2012/8/13 Bertrand Dechoux <de...@gmail.com>

> I am not sure to understand and I guess I am not the only one.
>
> 1) What's a worker in your context? Only the logic inside your Mapper or
> something else?
> 2) You should clarify your cases. You seem to have two cases but both are
> in overhead so I am assuming there is a baseline? Hadoop vs sequential, so
> sequential is not Hadoop?
> 3) What are the size of the file?
>
> Bertrand
>
>
> On Mon, Aug 13, 2012 at 1:51 PM, Matthias Kricke <
> matthias.mk.kricke@gmail.com> wrote:
>
>> Hello all,
>>
>> I'm using CDH3u3.
>> If I want to process one File, set to non splitable hadoop starts one
>> Mapper and no Reducer (thats ok for this test scenario). The Mapper
>> goes through a configuration step where some variables for the worker
>> inside the mapper are initialized.
>> Now the Mapper gives me K,V-pairs, which are lines of an input file. I
>> process the V with the worker.
>>
>> When I compare the run time of hadoop to the run time of the same process
>> in sequentiell manner, I get:
>>
>> worker time --> same in both cases
>>
>> case: mapper --> overhead of ~32% to the worker process (same for bigger
>> chunk size)
>> case: sequentiell --> overhead of ~15% to the worker process
>>
>> It shouldn't be that much slower, because of non splitable, the mapper
>> will be executed where the data is saved by HDFS, won't it?
>> Where did those 17% go? How to reduce this? Did hadoop needs the whole
>> time for reading or streaming the data out of HDFS?
>>
>> I would appreciate your help,
>>
>> Greetings
>> mk
>>
>>
>
>
> --
> Bertrand Dechoux
>

Re: how to enhance job start up speed?

Posted by Matthias Kricke <ma...@gmail.com>.
Ok, I try to clarify:

1) The worker is the logic inside my mapper and the same for both cases.
2) I have two cases. In the first one I use hadoop to execute my worker and
in a second one, I execute my worker without hadoop (simple read of the
file).
   Now I measured, for both cases, the time the worker and
the surroundings need (so i have two values for each case). The worker took
the same time in both cases for the same input (this is expected). But the
surroundings took 17%  more time when using hadoop.
3) ~  3GB.

I want to know how to reduce this difference and where they come from.
I hope that helped? If not, feel free to ask again :)

Greetings,
MK

P.S. just for your information, I did the same test with hypertable as
well.
I got:
 * worker without anything: 15% overhead
 * worker with hadoop: 32% overhead
 * worker with hypertable: 53% overhead
Remark: overhead was measured in comparison to the worker. e.g. hypertable
uses 53% of the whole process time, while worker uses 47%.

2012/8/13 Bertrand Dechoux <de...@gmail.com>

> I am not sure to understand and I guess I am not the only one.
>
> 1) What's a worker in your context? Only the logic inside your Mapper or
> something else?
> 2) You should clarify your cases. You seem to have two cases but both are
> in overhead so I am assuming there is a baseline? Hadoop vs sequential, so
> sequential is not Hadoop?
> 3) What are the size of the file?
>
> Bertrand
>
>
> On Mon, Aug 13, 2012 at 1:51 PM, Matthias Kricke <
> matthias.mk.kricke@gmail.com> wrote:
>
>> Hello all,
>>
>> I'm using CDH3u3.
>> If I want to process one File, set to non splitable hadoop starts one
>> Mapper and no Reducer (thats ok for this test scenario). The Mapper
>> goes through a configuration step where some variables for the worker
>> inside the mapper are initialized.
>> Now the Mapper gives me K,V-pairs, which are lines of an input file. I
>> process the V with the worker.
>>
>> When I compare the run time of hadoop to the run time of the same process
>> in sequentiell manner, I get:
>>
>> worker time --> same in both cases
>>
>> case: mapper --> overhead of ~32% to the worker process (same for bigger
>> chunk size)
>> case: sequentiell --> overhead of ~15% to the worker process
>>
>> It shouldn't be that much slower, because of non splitable, the mapper
>> will be executed where the data is saved by HDFS, won't it?
>> Where did those 17% go? How to reduce this? Did hadoop needs the whole
>> time for reading or streaming the data out of HDFS?
>>
>> I would appreciate your help,
>>
>> Greetings
>> mk
>>
>>
>
>
> --
> Bertrand Dechoux
>

Re: how to enhance job start up speed?

Posted by Bertrand Dechoux <de...@gmail.com>.
I am not sure to understand and I guess I am not the only one.

1) What's a worker in your context? Only the logic inside your Mapper or
something else?
2) You should clarify your cases. You seem to have two cases but both are
in overhead so I am assuming there is a baseline? Hadoop vs sequential, so
sequential is not Hadoop?
3) What are the size of the file?

Bertrand

On Mon, Aug 13, 2012 at 1:51 PM, Matthias Kricke <
matthias.mk.kricke@gmail.com> wrote:

> Hello all,
>
> I'm using CDH3u3.
> If I want to process one File, set to non splitable hadoop starts one
> Mapper and no Reducer (thats ok for this test scenario). The Mapper
> goes through a configuration step where some variables for the worker
> inside the mapper are initialized.
> Now the Mapper gives me K,V-pairs, which are lines of an input file. I
> process the V with the worker.
>
> When I compare the run time of hadoop to the run time of the same process
> in sequentiell manner, I get:
>
> worker time --> same in both cases
>
> case: mapper --> overhead of ~32% to the worker process (same for bigger
> chunk size)
> case: sequentiell --> overhead of ~15% to the worker process
>
> It shouldn't be that much slower, because of non splitable, the mapper
> will be executed where the data is saved by HDFS, won't it?
> Where did those 17% go? How to reduce this? Did hadoop needs the whole
> time for reading or streaming the data out of HDFS?
>
> I would appreciate your help,
>
> Greetings
> mk
>
>


-- 
Bertrand Dechoux

Re: how to enhance job start up speed?

Posted by Bertrand Dechoux <de...@gmail.com>.
I am not sure to understand and I guess I am not the only one.

1) What's a worker in your context? Only the logic inside your Mapper or
something else?
2) You should clarify your cases. You seem to have two cases but both are
in overhead so I am assuming there is a baseline? Hadoop vs sequential, so
sequential is not Hadoop?
3) What are the size of the file?

Bertrand

On Mon, Aug 13, 2012 at 1:51 PM, Matthias Kricke <
matthias.mk.kricke@gmail.com> wrote:

> Hello all,
>
> I'm using CDH3u3.
> If I want to process one File, set to non splitable hadoop starts one
> Mapper and no Reducer (thats ok for this test scenario). The Mapper
> goes through a configuration step where some variables for the worker
> inside the mapper are initialized.
> Now the Mapper gives me K,V-pairs, which are lines of an input file. I
> process the V with the worker.
>
> When I compare the run time of hadoop to the run time of the same process
> in sequentiell manner, I get:
>
> worker time --> same in both cases
>
> case: mapper --> overhead of ~32% to the worker process (same for bigger
> chunk size)
> case: sequentiell --> overhead of ~15% to the worker process
>
> It shouldn't be that much slower, because of non splitable, the mapper
> will be executed where the data is saved by HDFS, won't it?
> Where did those 17% go? How to reduce this? Did hadoop needs the whole
> time for reading or streaming the data out of HDFS?
>
> I would appreciate your help,
>
> Greetings
> mk
>
>


-- 
Bertrand Dechoux

Re: how to enhance job start up speed?

Posted by Bertrand Dechoux <de...@gmail.com>.
I am not sure to understand and I guess I am not the only one.

1) What's a worker in your context? Only the logic inside your Mapper or
something else?
2) You should clarify your cases. You seem to have two cases but both are
in overhead so I am assuming there is a baseline? Hadoop vs sequential, so
sequential is not Hadoop?
3) What are the size of the file?

Bertrand

On Mon, Aug 13, 2012 at 1:51 PM, Matthias Kricke <
matthias.mk.kricke@gmail.com> wrote:

> Hello all,
>
> I'm using CDH3u3.
> If I want to process one File, set to non splitable hadoop starts one
> Mapper and no Reducer (thats ok for this test scenario). The Mapper
> goes through a configuration step where some variables for the worker
> inside the mapper are initialized.
> Now the Mapper gives me K,V-pairs, which are lines of an input file. I
> process the V with the worker.
>
> When I compare the run time of hadoop to the run time of the same process
> in sequentiell manner, I get:
>
> worker time --> same in both cases
>
> case: mapper --> overhead of ~32% to the worker process (same for bigger
> chunk size)
> case: sequentiell --> overhead of ~15% to the worker process
>
> It shouldn't be that much slower, because of non splitable, the mapper
> will be executed where the data is saved by HDFS, won't it?
> Where did those 17% go? How to reduce this? Did hadoop needs the whole
> time for reading or streaming the data out of HDFS?
>
> I would appreciate your help,
>
> Greetings
> mk
>
>


-- 
Bertrand Dechoux

Re: how to enhance job start up speed?

Posted by Bertrand Dechoux <de...@gmail.com>.
I am not sure to understand and I guess I am not the only one.

1) What's a worker in your context? Only the logic inside your Mapper or
something else?
2) You should clarify your cases. You seem to have two cases but both are
in overhead so I am assuming there is a baseline? Hadoop vs sequential, so
sequential is not Hadoop?
3) What are the size of the file?

Bertrand

On Mon, Aug 13, 2012 at 1:51 PM, Matthias Kricke <
matthias.mk.kricke@gmail.com> wrote:

> Hello all,
>
> I'm using CDH3u3.
> If I want to process one File, set to non splitable hadoop starts one
> Mapper and no Reducer (thats ok for this test scenario). The Mapper
> goes through a configuration step where some variables for the worker
> inside the mapper are initialized.
> Now the Mapper gives me K,V-pairs, which are lines of an input file. I
> process the V with the worker.
>
> When I compare the run time of hadoop to the run time of the same process
> in sequentiell manner, I get:
>
> worker time --> same in both cases
>
> case: mapper --> overhead of ~32% to the worker process (same for bigger
> chunk size)
> case: sequentiell --> overhead of ~15% to the worker process
>
> It shouldn't be that much slower, because of non splitable, the mapper
> will be executed where the data is saved by HDFS, won't it?
> Where did those 17% go? How to reduce this? Did hadoop needs the whole
> time for reading or streaming the data out of HDFS?
>
> I would appreciate your help,
>
> Greetings
> mk
>
>


-- 
Bertrand Dechoux