You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hadoop.apache.org by "Kartashov, Andy" <An...@mpac.ca> on 2012/11/08 15:35:50 UTC

Hadoop processing

Hadoopers,
“Hadoop ships the code to the data instead of sending the data to the code.”
Say you added two DNs/TTs to the cluster. They have no data at this point, i.e. you have not ran the balancer.
In view of the above quoted statement, will these two nodes not participate in the MapReduce job until you balanced some data onto those nodes? Please kindly elaborate.

Rgds,
AK47
NOTICE: This e-mail message and any attachments are confidential, subject to copyright and may be privileged. Any unauthorized use, copying or disclosure is prohibited. If you are not the intended recipient, please delete and contact the sender immediately. Please consider the environment before printing this e-mail. AVIS : le présent courriel et toute pièce jointe qui l'accompagne sont confidentiels, protégés par le droit d'auteur et peuvent être couverts par le secret professionnel. Toute utilisation, copie ou divulgation non autorisée est interdite. Si vous n'êtes pas le destinataire prévu de ce courriel, supprimez-le et contactez immédiatement l'expéditeur. Veuillez penser à l'environnement avant d'imprimer le présent courriel

Re: Hadoop processing

Posted by Mohammad Tariq <do...@gmail.com>.

Hello Andy,

     Just to add to what Mr. Jay has said, MR framework does its best to
run the map task on a node where the input data is present. Sometimes,
however, all the nodes(based on the replication factor) hosting the data
block for a map task’s input split don't have any free slots. In that case,
the job scheduler will look for a free map slot on a node in the same rack
as one of the blocks. Very occasionally even this is not possible, so an
off-rack node is used

Regards,
    Mohammad Tariq



On Thu, Nov 8, 2012 at 8:19 PM, Jay Vyas <ja...@gmail.com> wrote:

> Hmm this is interesting.  I think that:
>
> 1) For the map phases, hadoop is smart enough to try to run mappers
> locally, but i think you could force these DNs to actively participate in a
> Mapper job by decreasing the size of input splits, which would allow for
> many more mappers, some of which would be forced to run on files which were
> not necessarily local - in this scenario, those DNs don't yet have any
> local files on them that would be used for the input.
>
> 2) For the reducer phases - since of course the reducers will be copying
> mapper outputs from all over the cluster, one would expect that your Data
> nodes would naturally take part in this portion of the task if the
> num.reducers parameter was specified.
>
>
> On Thu, Nov 8, 2012 at 9:35 AM, Kartashov, Andy <An...@mpac.ca>wrote:
>
>>  Hadoopers,
>>
>> “Hadoop ships the code to the data instead of sending the data to the
>> code.”
>>
>> Say you added two DNs/TTs to the cluster. They have no data at this
>> point, i.e. you have not ran the balancer.
>>
>> In view of the above quoted statement, will these two nodes not
>> participate in the MapReduce job until you balanced some data onto those
>> nodes? Please kindly elaborate.
>>
>>
>>
>> Rgds,
>>
>> AK47
>>  NOTICE: This e-mail message and any attachments are confidential,
>> subject to copyright and may be privileged. Any unauthorized use, copying
>> or disclosure is prohibited. If you are not the intended recipient, please
>> delete and contact the sender immediately. Please consider the environment
>> before printing this e-mail. AVIS : le présent courriel et toute pièce
>> jointe qui l'accompagne sont confidentiels, protégés par le droit d'auteur
>> et peuvent être couverts par le secret professionnel. Toute utilisation,
>> copie ou divulgation non autorisée est interdite. Si vous n'êtes pas le
>> destinataire prévu de ce courriel, supprimez-le et contactez immédiatement
>> l'expéditeur. Veuillez penser à l'environnement avant d'imprimer le présent
>> courriel
>>
>
>
>
> --
> Jay Vyas
> http://jayunit100.blogspot.com
>

Re: Hadoop processing

Posted by Mohammad Tariq <do...@gmail.com>.

Hello Andy,

     Just to add to what Mr. Jay has said, MR framework does its best to
run the map task on a node where the input data is present. Sometimes,
however, all the nodes(based on the replication factor) hosting the data
block for a map task’s input split don't have any free slots. In that case,
the job scheduler will look for a free map slot on a node in the same rack
as one of the blocks. Very occasionally even this is not possible, so an
off-rack node is used

Regards,
    Mohammad Tariq



On Thu, Nov 8, 2012 at 8:19 PM, Jay Vyas <ja...@gmail.com> wrote:

> Hmm this is interesting.  I think that:
>
> 1) For the map phases, hadoop is smart enough to try to run mappers
> locally, but i think you could force these DNs to actively participate in a
> Mapper job by decreasing the size of input splits, which would allow for
> many more mappers, some of which would be forced to run on files which were
> not necessarily local - in this scenario, those DNs don't yet have any
> local files on them that would be used for the input.
>
> 2) For the reducer phases - since of course the reducers will be copying
> mapper outputs from all over the cluster, one would expect that your Data
> nodes would naturally take part in this portion of the task if the
> num.reducers parameter was specified.
>
>
> On Thu, Nov 8, 2012 at 9:35 AM, Kartashov, Andy <An...@mpac.ca>wrote:
>
>>  Hadoopers,
>>
>> “Hadoop ships the code to the data instead of sending the data to the
>> code.”
>>
>> Say you added two DNs/TTs to the cluster. They have no data at this
>> point, i.e. you have not ran the balancer.
>>
>> In view of the above quoted statement, will these two nodes not
>> participate in the MapReduce job until you balanced some data onto those
>> nodes? Please kindly elaborate.
>>
>>
>>
>> Rgds,
>>
>> AK47
>>  NOTICE: This e-mail message and any attachments are confidential,
>> subject to copyright and may be privileged. Any unauthorized use, copying
>> or disclosure is prohibited. If you are not the intended recipient, please
>> delete and contact the sender immediately. Please consider the environment
>> before printing this e-mail. AVIS : le présent courriel et toute pièce
>> jointe qui l'accompagne sont confidentiels, protégés par le droit d'auteur
>> et peuvent être couverts par le secret professionnel. Toute utilisation,
>> copie ou divulgation non autorisée est interdite. Si vous n'êtes pas le
>> destinataire prévu de ce courriel, supprimez-le et contactez immédiatement
>> l'expéditeur. Veuillez penser à l'environnement avant d'imprimer le présent
>> courriel
>>
>
>
>
> --
> Jay Vyas
> http://jayunit100.blogspot.com
>

RE: Hadoop processing

Posted by "Kartashov, Andy" <An...@mpac.ca>.

Thanks guys for your responses. This is exactly what my guts were telling me.
I suspected that : "So in that case, yes the data is shipped to the node".

As per suggestion on here, I went and checked out Hadoop test examples and came across below question. I thought that C (the correct answer) wasn't entirely correct so I went with A. :(
How does Hadoop process large volumes of data?


A.  Hadoop uses a lot of machines in parallel. This optimizes data processing.

B.  Hadoop was specifically designed to process large amount of data by taking advantage of MPP hardware

C.  Hadoop ships the code to the data instead of sending the data to the code.

D.  Hadoop uses sophisticated cacheing techniques on namenode to speed processing of data

Rgds,
AK47

From: Michael Segel [mailto:michael_segel@hotmail.com]
Sent: Thursday, November 08, 2012 10:03 AM
To: user@hadoop.apache.org
Subject: Re: Hadoop processing

To go back to the OP's initial position.
2 new nodes where data hasn't yet been 'balanced'.

First, that's a small window of time.

But to answer your question...

The JT will attempt to schedule work to where the data is. If you're using 3X replication, there are 3 nodes where the block resides. So you can calculate the odds of getting an open slot to process your data local to its location.

However, if there is an open slot where the data is not located, you will still process the data in that open slot. You lose data locality and that smaller chunk of data will be processed on that node.  So in that case, yes the data is shipped to the node. If you look at your job tracker web page for the results of your processing you will see something in terms of what percentage of the work occurred in terms of data locality. Hadoop is pretty good in that respect.


NOTE THE FOLLOWING...
If you know that the processing time is a couple of orders of magnitude longer than the time it takes to ship the data to a node, you can override the normal characteristic and force the processing to be done remotely. (We've done this and there is a paper on this on InfoQ) [We were bored and didn't like the fact that our Ganglia maps were not all red. We are evil in that way ;-) ] We really don't recommend doing this unless you are either insane or really know what you are doing.

HTH

-Mike

On Nov 8, 2012, at 8:49 AM, Jay Vyas <ja...@gmail.com>> wrote:


Hmm this is interesting.  I think that:

1) For the map phases, hadoop is smart enough to try to run mappers locally, but i think you could force these DNs to actively participate in a Mapper job by decreasing the size of input splits, which would allow for many more mappers, some of which would be forced to run on files which were not necessarily local - in this scenario, those DNs don't yet have any local files on them that would be used for the input.

2) For the reducer phases - since of course the reducers will be copying mapper outputs from all over the cluster, one would expect that your Data nodes would naturally take part in this portion of the task if the num.reducers parameter was specified.

On Thu, Nov 8, 2012 at 9:35 AM, Kartashov, Andy <An...@mpac.ca>> wrote:
Hadoopers,
"Hadoop ships the code to the data instead of sending the data to the code."
Say you added two DNs/TTs to the cluster. They have no data at this point, i.e. you have not ran the balancer.
In view of the above quoted statement, will these two nodes not participate in the MapReduce job until you balanced some data onto those nodes? Please kindly elaborate.

Rgds,
AK47
NOTICE: This e-mail message and any attachments are confidential, subject to copyright and may be privileged. Any unauthorized use, copying or disclosure is prohibited. If you are not the intended recipient, please delete and contact the sender immediately. Please consider the environment before printing this e-mail. AVIS : le présent courriel et toute pièce jointe qui l'accompagne sont confidentiels, protégés par le droit d'auteur et peuvent être couverts par le secret professionnel. Toute utilisation, copie ou divulgation non autorisée est interdite. Si vous n'êtes pas le destinataire prévu de ce courriel, supprimez-le et contactez immédiatement l'expéditeur. Veuillez penser à l'environnement avant d'imprimer le présent courriel



--
Jay Vyas
http://jayunit100.blogspot.com<http://jayunit100.blogspot.com/>

NOTICE: This e-mail message and any attachments are confidential, subject to copyright and may be privileged. Any unauthorized use, copying or disclosure is prohibited. If you are not the intended recipient, please delete and contact the sender immediately. Please consider the environment before printing this e-mail. AVIS : le présent courriel et toute pièce jointe qui l'accompagne sont confidentiels, protégés par le droit d'auteur et peuvent être couverts par le secret professionnel. Toute utilisation, copie ou divulgation non autorisée est interdite. Si vous n'êtes pas le destinataire prévu de ce courriel, supprimez-le et contactez immédiatement l'expéditeur. Veuillez penser à l'environnement avant d'imprimer le présent courriel

RE: Hadoop processing

Posted by "Kartashov, Andy" <An...@mpac.ca>.

Thanks guys for your responses. This is exactly what my guts were telling me.
I suspected that : "So in that case, yes the data is shipped to the node".

As per suggestion on here, I went and checked out Hadoop test examples and came across below question. I thought that C (the correct answer) wasn't entirely correct so I went with A. :(
How does Hadoop process large volumes of data?


A.  Hadoop uses a lot of machines in parallel. This optimizes data processing.

B.  Hadoop was specifically designed to process large amount of data by taking advantage of MPP hardware

C.  Hadoop ships the code to the data instead of sending the data to the code.

D.  Hadoop uses sophisticated cacheing techniques on namenode to speed processing of data

Rgds,
AK47

From: Michael Segel [mailto:michael_segel@hotmail.com]
Sent: Thursday, November 08, 2012 10:03 AM
To: user@hadoop.apache.org
Subject: Re: Hadoop processing

To go back to the OP's initial position.
2 new nodes where data hasn't yet been 'balanced'.

First, that's a small window of time.

But to answer your question...

The JT will attempt to schedule work to where the data is. If you're using 3X replication, there are 3 nodes where the block resides. So you can calculate the odds of getting an open slot to process your data local to its location.

However, if there is an open slot where the data is not located, you will still process the data in that open slot. You lose data locality and that smaller chunk of data will be processed on that node.  So in that case, yes the data is shipped to the node. If you look at your job tracker web page for the results of your processing you will see something in terms of what percentage of the work occurred in terms of data locality. Hadoop is pretty good in that respect.


NOTE THE FOLLOWING...
If you know that the processing time is a couple of orders of magnitude longer than the time it takes to ship the data to a node, you can override the normal characteristic and force the processing to be done remotely. (We've done this and there is a paper on this on InfoQ) [We were bored and didn't like the fact that our Ganglia maps were not all red. We are evil in that way ;-) ] We really don't recommend doing this unless you are either insane or really know what you are doing.

HTH

-Mike

On Nov 8, 2012, at 8:49 AM, Jay Vyas <ja...@gmail.com>> wrote:


Hmm this is interesting.  I think that:

1) For the map phases, hadoop is smart enough to try to run mappers locally, but i think you could force these DNs to actively participate in a Mapper job by decreasing the size of input splits, which would allow for many more mappers, some of which would be forced to run on files which were not necessarily local - in this scenario, those DNs don't yet have any local files on them that would be used for the input.

2) For the reducer phases - since of course the reducers will be copying mapper outputs from all over the cluster, one would expect that your Data nodes would naturally take part in this portion of the task if the num.reducers parameter was specified.

On Thu, Nov 8, 2012 at 9:35 AM, Kartashov, Andy <An...@mpac.ca>> wrote:
Hadoopers,
"Hadoop ships the code to the data instead of sending the data to the code."
Say you added two DNs/TTs to the cluster. They have no data at this point, i.e. you have not ran the balancer.
In view of the above quoted statement, will these two nodes not participate in the MapReduce job until you balanced some data onto those nodes? Please kindly elaborate.

Rgds,
AK47
NOTICE: This e-mail message and any attachments are confidential, subject to copyright and may be privileged. Any unauthorized use, copying or disclosure is prohibited. If you are not the intended recipient, please delete and contact the sender immediately. Please consider the environment before printing this e-mail. AVIS : le présent courriel et toute pièce jointe qui l'accompagne sont confidentiels, protégés par le droit d'auteur et peuvent être couverts par le secret professionnel. Toute utilisation, copie ou divulgation non autorisée est interdite. Si vous n'êtes pas le destinataire prévu de ce courriel, supprimez-le et contactez immédiatement l'expéditeur. Veuillez penser à l'environnement avant d'imprimer le présent courriel



--
Jay Vyas
http://jayunit100.blogspot.com<http://jayunit100.blogspot.com/>

NOTICE: This e-mail message and any attachments are confidential, subject to copyright and may be privileged. Any unauthorized use, copying or disclosure is prohibited. If you are not the intended recipient, please delete and contact the sender immediately. Please consider the environment before printing this e-mail. AVIS : le présent courriel et toute pièce jointe qui l'accompagne sont confidentiels, protégés par le droit d'auteur et peuvent être couverts par le secret professionnel. Toute utilisation, copie ou divulgation non autorisée est interdite. Si vous n'êtes pas le destinataire prévu de ce courriel, supprimez-le et contactez immédiatement l'expéditeur. Veuillez penser à l'environnement avant d'imprimer le présent courriel

RE: Hadoop processing

Posted by "Kartashov, Andy" <An...@mpac.ca>.

Thanks guys for your responses. This is exactly what my guts were telling me.
I suspected that : "So in that case, yes the data is shipped to the node".

As per suggestion on here, I went and checked out Hadoop test examples and came across below question. I thought that C (the correct answer) wasn't entirely correct so I went with A. :(
How does Hadoop process large volumes of data?


A.  Hadoop uses a lot of machines in parallel. This optimizes data processing.

B.  Hadoop was specifically designed to process large amount of data by taking advantage of MPP hardware

C.  Hadoop ships the code to the data instead of sending the data to the code.

D.  Hadoop uses sophisticated cacheing techniques on namenode to speed processing of data

Rgds,
AK47

From: Michael Segel [mailto:michael_segel@hotmail.com]
Sent: Thursday, November 08, 2012 10:03 AM
To: user@hadoop.apache.org
Subject: Re: Hadoop processing

To go back to the OP's initial position.
2 new nodes where data hasn't yet been 'balanced'.

First, that's a small window of time.

But to answer your question...

The JT will attempt to schedule work to where the data is. If you're using 3X replication, there are 3 nodes where the block resides. So you can calculate the odds of getting an open slot to process your data local to its location.

However, if there is an open slot where the data is not located, you will still process the data in that open slot. You lose data locality and that smaller chunk of data will be processed on that node.  So in that case, yes the data is shipped to the node. If you look at your job tracker web page for the results of your processing you will see something in terms of what percentage of the work occurred in terms of data locality. Hadoop is pretty good in that respect.


NOTE THE FOLLOWING...
If you know that the processing time is a couple of orders of magnitude longer than the time it takes to ship the data to a node, you can override the normal characteristic and force the processing to be done remotely. (We've done this and there is a paper on this on InfoQ) [We were bored and didn't like the fact that our Ganglia maps were not all red. We are evil in that way ;-) ] We really don't recommend doing this unless you are either insane or really know what you are doing.

HTH

-Mike

On Nov 8, 2012, at 8:49 AM, Jay Vyas <ja...@gmail.com>> wrote:


Hmm this is interesting.  I think that:

1) For the map phases, hadoop is smart enough to try to run mappers locally, but i think you could force these DNs to actively participate in a Mapper job by decreasing the size of input splits, which would allow for many more mappers, some of which would be forced to run on files which were not necessarily local - in this scenario, those DNs don't yet have any local files on them that would be used for the input.

2) For the reducer phases - since of course the reducers will be copying mapper outputs from all over the cluster, one would expect that your Data nodes would naturally take part in this portion of the task if the num.reducers parameter was specified.

On Thu, Nov 8, 2012 at 9:35 AM, Kartashov, Andy <An...@mpac.ca>> wrote:
Hadoopers,
"Hadoop ships the code to the data instead of sending the data to the code."
Say you added two DNs/TTs to the cluster. They have no data at this point, i.e. you have not ran the balancer.
In view of the above quoted statement, will these two nodes not participate in the MapReduce job until you balanced some data onto those nodes? Please kindly elaborate.

Rgds,
AK47
NOTICE: This e-mail message and any attachments are confidential, subject to copyright and may be privileged. Any unauthorized use, copying or disclosure is prohibited. If you are not the intended recipient, please delete and contact the sender immediately. Please consider the environment before printing this e-mail. AVIS : le présent courriel et toute pièce jointe qui l'accompagne sont confidentiels, protégés par le droit d'auteur et peuvent être couverts par le secret professionnel. Toute utilisation, copie ou divulgation non autorisée est interdite. Si vous n'êtes pas le destinataire prévu de ce courriel, supprimez-le et contactez immédiatement l'expéditeur. Veuillez penser à l'environnement avant d'imprimer le présent courriel



--
Jay Vyas
http://jayunit100.blogspot.com<http://jayunit100.blogspot.com/>

NOTICE: This e-mail message and any attachments are confidential, subject to copyright and may be privileged. Any unauthorized use, copying or disclosure is prohibited. If you are not the intended recipient, please delete and contact the sender immediately. Please consider the environment before printing this e-mail. AVIS : le présent courriel et toute pièce jointe qui l'accompagne sont confidentiels, protégés par le droit d'auteur et peuvent être couverts par le secret professionnel. Toute utilisation, copie ou divulgation non autorisée est interdite. Si vous n'êtes pas le destinataire prévu de ce courriel, supprimez-le et contactez immédiatement l'expéditeur. Veuillez penser à l'environnement avant d'imprimer le présent courriel

RE: Hadoop processing

Posted by "Kartashov, Andy" <An...@mpac.ca>.

Thanks guys for your responses. This is exactly what my guts were telling me.
I suspected that : "So in that case, yes the data is shipped to the node".

As per suggestion on here, I went and checked out Hadoop test examples and came across below question. I thought that C (the correct answer) wasn't entirely correct so I went with A. :(
How does Hadoop process large volumes of data?


A.  Hadoop uses a lot of machines in parallel. This optimizes data processing.

B.  Hadoop was specifically designed to process large amount of data by taking advantage of MPP hardware

C.  Hadoop ships the code to the data instead of sending the data to the code.

D.  Hadoop uses sophisticated cacheing techniques on namenode to speed processing of data

Rgds,
AK47

From: Michael Segel [mailto:michael_segel@hotmail.com]
Sent: Thursday, November 08, 2012 10:03 AM
To: user@hadoop.apache.org
Subject: Re: Hadoop processing

To go back to the OP's initial position.
2 new nodes where data hasn't yet been 'balanced'.

First, that's a small window of time.

But to answer your question...

The JT will attempt to schedule work to where the data is. If you're using 3X replication, there are 3 nodes where the block resides. So you can calculate the odds of getting an open slot to process your data local to its location.

However, if there is an open slot where the data is not located, you will still process the data in that open slot. You lose data locality and that smaller chunk of data will be processed on that node.  So in that case, yes the data is shipped to the node. If you look at your job tracker web page for the results of your processing you will see something in terms of what percentage of the work occurred in terms of data locality. Hadoop is pretty good in that respect.


NOTE THE FOLLOWING...
If you know that the processing time is a couple of orders of magnitude longer than the time it takes to ship the data to a node, you can override the normal characteristic and force the processing to be done remotely. (We've done this and there is a paper on this on InfoQ) [We were bored and didn't like the fact that our Ganglia maps were not all red. We are evil in that way ;-) ] We really don't recommend doing this unless you are either insane or really know what you are doing.

HTH

-Mike

On Nov 8, 2012, at 8:49 AM, Jay Vyas <ja...@gmail.com>> wrote:


Hmm this is interesting.  I think that:

1) For the map phases, hadoop is smart enough to try to run mappers locally, but i think you could force these DNs to actively participate in a Mapper job by decreasing the size of input splits, which would allow for many more mappers, some of which would be forced to run on files which were not necessarily local - in this scenario, those DNs don't yet have any local files on them that would be used for the input.

2) For the reducer phases - since of course the reducers will be copying mapper outputs from all over the cluster, one would expect that your Data nodes would naturally take part in this portion of the task if the num.reducers parameter was specified.

On Thu, Nov 8, 2012 at 9:35 AM, Kartashov, Andy <An...@mpac.ca>> wrote:
Hadoopers,
"Hadoop ships the code to the data instead of sending the data to the code."
Say you added two DNs/TTs to the cluster. They have no data at this point, i.e. you have not ran the balancer.
In view of the above quoted statement, will these two nodes not participate in the MapReduce job until you balanced some data onto those nodes? Please kindly elaborate.

Rgds,
AK47
NOTICE: This e-mail message and any attachments are confidential, subject to copyright and may be privileged. Any unauthorized use, copying or disclosure is prohibited. If you are not the intended recipient, please delete and contact the sender immediately. Please consider the environment before printing this e-mail. AVIS : le présent courriel et toute pièce jointe qui l'accompagne sont confidentiels, protégés par le droit d'auteur et peuvent être couverts par le secret professionnel. Toute utilisation, copie ou divulgation non autorisée est interdite. Si vous n'êtes pas le destinataire prévu de ce courriel, supprimez-le et contactez immédiatement l'expéditeur. Veuillez penser à l'environnement avant d'imprimer le présent courriel



--
Jay Vyas
http://jayunit100.blogspot.com<http://jayunit100.blogspot.com/>

NOTICE: This e-mail message and any attachments are confidential, subject to copyright and may be privileged. Any unauthorized use, copying or disclosure is prohibited. If you are not the intended recipient, please delete and contact the sender immediately. Please consider the environment before printing this e-mail. AVIS : le présent courriel et toute pièce jointe qui l'accompagne sont confidentiels, protégés par le droit d'auteur et peuvent être couverts par le secret professionnel. Toute utilisation, copie ou divulgation non autorisée est interdite. Si vous n'êtes pas le destinataire prévu de ce courriel, supprimez-le et contactez immédiatement l'expéditeur. Veuillez penser à l'environnement avant d'imprimer le présent courriel

Re: Hadoop processing

Posted by Michael Segel <mi...@hotmail.com>.

To go back to the OP's initial position. 
2 new nodes where data hasn't yet been 'balanced'. 

First, that's a small window of time. 

But to answer your question... 

The JT will attempt to schedule work to where the data is. If you're using 3X replication, there are 3 nodes where the block resides. So you can calculate the odds of getting an open slot to process your data local to its location. 

However, if there is an open slot where the data is not located, you will still process the data in that open slot. You lose data locality and that smaller chunk of data will be processed on that node.  So in that case, yes the data is shipped to the node. If you look at your job tracker web page for the results of your processing you will see something in terms of what percentage of the work occurred in terms of data locality. Hadoop is pretty good in that respect. 

NOTE THE FOLLOWING... 
If you know that the processing time is a couple of orders of magnitude longer than the time it takes to ship the data to a node, you can override the normal characteristic and force the processing to be done remotely. (We've done this and there is a paper on this on InfoQ) [We were bored and didn't like the fact that our Ganglia maps were not all red. We are evil in that way ;-) ] We really don't recommend doing this unless you are either insane or really know what you are doing. 

HTH

-Mike

On Nov 8, 2012, at 8:49 AM, Jay Vyas <ja...@gmail.com> wrote:

> Hmm this is interesting.  I think that: 
> 
> 1) For the map phases, hadoop is smart enough to try to run mappers locally, but i think you could force these DNs to actively participate in a Mapper job by decreasing the size of input splits, which would allow for many more mappers, some of which would be forced to run on files which were not necessarily local - in this scenario, those DNs don't yet have any local files on them that would be used for the input. 
> 
> 2) For the reducer phases - since of course the reducers will be copying mapper outputs from all over the cluster, one would expect that your Data nodes would naturally take part in this portion of the task if the num.reducers parameter was specified. 
> 
> 
> On Thu, Nov 8, 2012 at 9:35 AM, Kartashov, Andy <An...@mpac.ca> wrote:
> Hadoopers,
> 
> “Hadoop ships the code to the data instead of sending the data to the code.”
> 
> Say you added two DNs/TTs to the cluster. They have no data at this point, i.e. you have not ran the balancer.
> 
> In view of the above quoted statement, will these two nodes not participate in the MapReduce job until you balanced some data onto those nodes? Please kindly elaborate.
> 
>  
> Rgds,
> 
> AK47
> 
> NOTICE: This e-mail message and any attachments are confidential, subject to copyright and may be privileged. Any unauthorized use, copying or disclosure is prohibited. If you are not the intended recipient, please delete and contact the sender immediately. Please consider the environment before printing this e-mail. AVIS : le présent courriel et toute pièce jointe qui l'accompagne sont confidentiels, protégés par le droit d'auteur et peuvent être couverts par le secret professionnel. Toute utilisation, copie ou divulgation non autorisée est interdite. Si vous n'êtes pas le destinataire prévu de ce courriel, supprimez-le et contactez immédiatement l'expéditeur. Veuillez penser à l'environnement avant d'imprimer le présent courriel
> 
> 
> 
> -- 
> Jay Vyas
> http://jayunit100.blogspot.com

Re: Hadoop processing

Posted by Mohammad Tariq <do...@gmail.com>.

Hello Andy,

     Just to add to what Mr. Jay has said, MR framework does its best to
run the map task on a node where the input data is present. Sometimes,
however, all the nodes(based on the replication factor) hosting the data
block for a map task’s input split don't have any free slots. In that case,
the job scheduler will look for a free map slot on a node in the same rack
as one of the blocks. Very occasionally even this is not possible, so an
off-rack node is used

Regards,
    Mohammad Tariq



On Thu, Nov 8, 2012 at 8:19 PM, Jay Vyas <ja...@gmail.com> wrote:

> Hmm this is interesting.  I think that:
>
> 1) For the map phases, hadoop is smart enough to try to run mappers
> locally, but i think you could force these DNs to actively participate in a
> Mapper job by decreasing the size of input splits, which would allow for
> many more mappers, some of which would be forced to run on files which were
> not necessarily local - in this scenario, those DNs don't yet have any
> local files on them that would be used for the input.
>
> 2) For the reducer phases - since of course the reducers will be copying
> mapper outputs from all over the cluster, one would expect that your Data
> nodes would naturally take part in this portion of the task if the
> num.reducers parameter was specified.
>
>
> On Thu, Nov 8, 2012 at 9:35 AM, Kartashov, Andy <An...@mpac.ca>wrote:
>
>>  Hadoopers,
>>
>> “Hadoop ships the code to the data instead of sending the data to the
>> code.”
>>
>> Say you added two DNs/TTs to the cluster. They have no data at this
>> point, i.e. you have not ran the balancer.
>>
>> In view of the above quoted statement, will these two nodes not
>> participate in the MapReduce job until you balanced some data onto those
>> nodes? Please kindly elaborate.
>>
>>
>>
>> Rgds,
>>
>> AK47
>>  NOTICE: This e-mail message and any attachments are confidential,
>> subject to copyright and may be privileged. Any unauthorized use, copying
>> or disclosure is prohibited. If you are not the intended recipient, please
>> delete and contact the sender immediately. Please consider the environment
>> before printing this e-mail. AVIS : le présent courriel et toute pièce
>> jointe qui l'accompagne sont confidentiels, protégés par le droit d'auteur
>> et peuvent être couverts par le secret professionnel. Toute utilisation,
>> copie ou divulgation non autorisée est interdite. Si vous n'êtes pas le
>> destinataire prévu de ce courriel, supprimez-le et contactez immédiatement
>> l'expéditeur. Veuillez penser à l'environnement avant d'imprimer le présent
>> courriel
>>
>
>
>
> --
> Jay Vyas
> http://jayunit100.blogspot.com
>

Re: Hadoop processing

Posted by Mohammad Tariq <do...@gmail.com>.

Hello Andy,

     Just to add to what Mr. Jay has said, MR framework does its best to
run the map task on a node where the input data is present. Sometimes,
however, all the nodes(based on the replication factor) hosting the data
block for a map task’s input split don't have any free slots. In that case,
the job scheduler will look for a free map slot on a node in the same rack
as one of the blocks. Very occasionally even this is not possible, so an
off-rack node is used

Regards,
    Mohammad Tariq



On Thu, Nov 8, 2012 at 8:19 PM, Jay Vyas <ja...@gmail.com> wrote:

> Hmm this is interesting.  I think that:
>
> 1) For the map phases, hadoop is smart enough to try to run mappers
> locally, but i think you could force these DNs to actively participate in a
> Mapper job by decreasing the size of input splits, which would allow for
> many more mappers, some of which would be forced to run on files which were
> not necessarily local - in this scenario, those DNs don't yet have any
> local files on them that would be used for the input.
>
> 2) For the reducer phases - since of course the reducers will be copying
> mapper outputs from all over the cluster, one would expect that your Data
> nodes would naturally take part in this portion of the task if the
> num.reducers parameter was specified.
>
>
> On Thu, Nov 8, 2012 at 9:35 AM, Kartashov, Andy <An...@mpac.ca>wrote:
>
>>  Hadoopers,
>>
>> “Hadoop ships the code to the data instead of sending the data to the
>> code.”
>>
>> Say you added two DNs/TTs to the cluster. They have no data at this
>> point, i.e. you have not ran the balancer.
>>
>> In view of the above quoted statement, will these two nodes not
>> participate in the MapReduce job until you balanced some data onto those
>> nodes? Please kindly elaborate.
>>
>>
>>
>> Rgds,
>>
>> AK47
>>  NOTICE: This e-mail message and any attachments are confidential,
>> subject to copyright and may be privileged. Any unauthorized use, copying
>> or disclosure is prohibited. If you are not the intended recipient, please
>> delete and contact the sender immediately. Please consider the environment
>> before printing this e-mail. AVIS : le présent courriel et toute pièce
>> jointe qui l'accompagne sont confidentiels, protégés par le droit d'auteur
>> et peuvent être couverts par le secret professionnel. Toute utilisation,
>> copie ou divulgation non autorisée est interdite. Si vous n'êtes pas le
>> destinataire prévu de ce courriel, supprimez-le et contactez immédiatement
>> l'expéditeur. Veuillez penser à l'environnement avant d'imprimer le présent
>> courriel
>>
>
>
>
> --
> Jay Vyas
> http://jayunit100.blogspot.com
>

Re: Hadoop processing

Posted by Michael Segel <mi...@hotmail.com>.

To go back to the OP's initial position. 
2 new nodes where data hasn't yet been 'balanced'. 

First, that's a small window of time. 

But to answer your question... 

The JT will attempt to schedule work to where the data is. If you're using 3X replication, there are 3 nodes where the block resides. So you can calculate the odds of getting an open slot to process your data local to its location. 

However, if there is an open slot where the data is not located, you will still process the data in that open slot. You lose data locality and that smaller chunk of data will be processed on that node.  So in that case, yes the data is shipped to the node. If you look at your job tracker web page for the results of your processing you will see something in terms of what percentage of the work occurred in terms of data locality. Hadoop is pretty good in that respect. 

NOTE THE FOLLOWING... 
If you know that the processing time is a couple of orders of magnitude longer than the time it takes to ship the data to a node, you can override the normal characteristic and force the processing to be done remotely. (We've done this and there is a paper on this on InfoQ) [We were bored and didn't like the fact that our Ganglia maps were not all red. We are evil in that way ;-) ] We really don't recommend doing this unless you are either insane or really know what you are doing. 

HTH

-Mike

On Nov 8, 2012, at 8:49 AM, Jay Vyas <ja...@gmail.com> wrote:

> Hmm this is interesting.  I think that: 
> 
> 1) For the map phases, hadoop is smart enough to try to run mappers locally, but i think you could force these DNs to actively participate in a Mapper job by decreasing the size of input splits, which would allow for many more mappers, some of which would be forced to run on files which were not necessarily local - in this scenario, those DNs don't yet have any local files on them that would be used for the input. 
> 
> 2) For the reducer phases - since of course the reducers will be copying mapper outputs from all over the cluster, one would expect that your Data nodes would naturally take part in this portion of the task if the num.reducers parameter was specified. 
> 
> 
> On Thu, Nov 8, 2012 at 9:35 AM, Kartashov, Andy <An...@mpac.ca> wrote:
> Hadoopers,
> 
> “Hadoop ships the code to the data instead of sending the data to the code.”
> 
> Say you added two DNs/TTs to the cluster. They have no data at this point, i.e. you have not ran the balancer.
> 
> In view of the above quoted statement, will these two nodes not participate in the MapReduce job until you balanced some data onto those nodes? Please kindly elaborate.
> 
>  
> Rgds,
> 
> AK47
> 
> NOTICE: This e-mail message and any attachments are confidential, subject to copyright and may be privileged. Any unauthorized use, copying or disclosure is prohibited. If you are not the intended recipient, please delete and contact the sender immediately. Please consider the environment before printing this e-mail. AVIS : le présent courriel et toute pièce jointe qui l'accompagne sont confidentiels, protégés par le droit d'auteur et peuvent être couverts par le secret professionnel. Toute utilisation, copie ou divulgation non autorisée est interdite. Si vous n'êtes pas le destinataire prévu de ce courriel, supprimez-le et contactez immédiatement l'expéditeur. Veuillez penser à l'environnement avant d'imprimer le présent courriel
> 
> 
> 
> -- 
> Jay Vyas
> http://jayunit100.blogspot.com

Re: Hadoop processing

Posted by Michael Segel <mi...@hotmail.com>.

To go back to the OP's initial position. 
2 new nodes where data hasn't yet been 'balanced'. 

First, that's a small window of time. 

But to answer your question... 

The JT will attempt to schedule work to where the data is. If you're using 3X replication, there are 3 nodes where the block resides. So you can calculate the odds of getting an open slot to process your data local to its location. 

However, if there is an open slot where the data is not located, you will still process the data in that open slot. You lose data locality and that smaller chunk of data will be processed on that node.  So in that case, yes the data is shipped to the node. If you look at your job tracker web page for the results of your processing you will see something in terms of what percentage of the work occurred in terms of data locality. Hadoop is pretty good in that respect. 

NOTE THE FOLLOWING... 
If you know that the processing time is a couple of orders of magnitude longer than the time it takes to ship the data to a node, you can override the normal characteristic and force the processing to be done remotely. (We've done this and there is a paper on this on InfoQ) [We were bored and didn't like the fact that our Ganglia maps were not all red. We are evil in that way ;-) ] We really don't recommend doing this unless you are either insane or really know what you are doing. 

HTH

-Mike

On Nov 8, 2012, at 8:49 AM, Jay Vyas <ja...@gmail.com> wrote:

> Hmm this is interesting.  I think that: 
> 
> 1) For the map phases, hadoop is smart enough to try to run mappers locally, but i think you could force these DNs to actively participate in a Mapper job by decreasing the size of input splits, which would allow for many more mappers, some of which would be forced to run on files which were not necessarily local - in this scenario, those DNs don't yet have any local files on them that would be used for the input. 
> 
> 2) For the reducer phases - since of course the reducers will be copying mapper outputs from all over the cluster, one would expect that your Data nodes would naturally take part in this portion of the task if the num.reducers parameter was specified. 
> 
> 
> On Thu, Nov 8, 2012 at 9:35 AM, Kartashov, Andy <An...@mpac.ca> wrote:
> Hadoopers,
> 
> “Hadoop ships the code to the data instead of sending the data to the code.”
> 
> Say you added two DNs/TTs to the cluster. They have no data at this point, i.e. you have not ran the balancer.
> 
> In view of the above quoted statement, will these two nodes not participate in the MapReduce job until you balanced some data onto those nodes? Please kindly elaborate.
> 
>  
> Rgds,
> 
> AK47
> 
> NOTICE: This e-mail message and any attachments are confidential, subject to copyright and may be privileged. Any unauthorized use, copying or disclosure is prohibited. If you are not the intended recipient, please delete and contact the sender immediately. Please consider the environment before printing this e-mail. AVIS : le présent courriel et toute pièce jointe qui l'accompagne sont confidentiels, protégés par le droit d'auteur et peuvent être couverts par le secret professionnel. Toute utilisation, copie ou divulgation non autorisée est interdite. Si vous n'êtes pas le destinataire prévu de ce courriel, supprimez-le et contactez immédiatement l'expéditeur. Veuillez penser à l'environnement avant d'imprimer le présent courriel
> 
> 
> 
> -- 
> Jay Vyas
> http://jayunit100.blogspot.com

Re: Hadoop processing

Posted by Michael Segel <mi...@hotmail.com>.

To go back to the OP's initial position. 
2 new nodes where data hasn't yet been 'balanced'. 

First, that's a small window of time. 

But to answer your question... 

The JT will attempt to schedule work to where the data is. If you're using 3X replication, there are 3 nodes where the block resides. So you can calculate the odds of getting an open slot to process your data local to its location. 

However, if there is an open slot where the data is not located, you will still process the data in that open slot. You lose data locality and that smaller chunk of data will be processed on that node.  So in that case, yes the data is shipped to the node. If you look at your job tracker web page for the results of your processing you will see something in terms of what percentage of the work occurred in terms of data locality. Hadoop is pretty good in that respect. 

NOTE THE FOLLOWING... 
If you know that the processing time is a couple of orders of magnitude longer than the time it takes to ship the data to a node, you can override the normal characteristic and force the processing to be done remotely. (We've done this and there is a paper on this on InfoQ) [We were bored and didn't like the fact that our Ganglia maps were not all red. We are evil in that way ;-) ] We really don't recommend doing this unless you are either insane or really know what you are doing. 

HTH

-Mike

On Nov 8, 2012, at 8:49 AM, Jay Vyas <ja...@gmail.com> wrote:

> Hmm this is interesting.  I think that: 
> 
> 1) For the map phases, hadoop is smart enough to try to run mappers locally, but i think you could force these DNs to actively participate in a Mapper job by decreasing the size of input splits, which would allow for many more mappers, some of which would be forced to run on files which were not necessarily local - in this scenario, those DNs don't yet have any local files on them that would be used for the input. 
> 
> 2) For the reducer phases - since of course the reducers will be copying mapper outputs from all over the cluster, one would expect that your Data nodes would naturally take part in this portion of the task if the num.reducers parameter was specified. 
> 
> 
> On Thu, Nov 8, 2012 at 9:35 AM, Kartashov, Andy <An...@mpac.ca> wrote:
> Hadoopers,
> 
> “Hadoop ships the code to the data instead of sending the data to the code.”
> 
> Say you added two DNs/TTs to the cluster. They have no data at this point, i.e. you have not ran the balancer.
> 
> In view of the above quoted statement, will these two nodes not participate in the MapReduce job until you balanced some data onto those nodes? Please kindly elaborate.
> 
>  
> Rgds,
> 
> AK47
> 
> NOTICE: This e-mail message and any attachments are confidential, subject to copyright and may be privileged. Any unauthorized use, copying or disclosure is prohibited. If you are not the intended recipient, please delete and contact the sender immediately. Please consider the environment before printing this e-mail. AVIS : le présent courriel et toute pièce jointe qui l'accompagne sont confidentiels, protégés par le droit d'auteur et peuvent être couverts par le secret professionnel. Toute utilisation, copie ou divulgation non autorisée est interdite. Si vous n'êtes pas le destinataire prévu de ce courriel, supprimez-le et contactez immédiatement l'expéditeur. Veuillez penser à l'environnement avant d'imprimer le présent courriel
> 
> 
> 
> -- 
> Jay Vyas
> http://jayunit100.blogspot.com

Re: Hadoop processing

Posted by Jay Vyas <ja...@gmail.com>.

Hmm this is interesting.  I think that:

1) For the map phases, hadoop is smart enough to try to run mappers
locally, but i think you could force these DNs to actively participate in a
Mapper job by decreasing the size of input splits, which would allow for
many more mappers, some of which would be forced to run on files which were
not necessarily local - in this scenario, those DNs don't yet have any
local files on them that would be used for the input.

2) For the reducer phases - since of course the reducers will be copying
mapper outputs from all over the cluster, one would expect that your Data
nodes would naturally take part in this portion of the task if the
num.reducers parameter was specified.

On Thu, Nov 8, 2012 at 9:35 AM, Kartashov, Andy <An...@mpac.ca>wrote:

>  Hadoopers,
>
> “Hadoop ships the code to the data instead of sending the data to the
> code.”
>
> Say you added two DNs/TTs to the cluster. They have no data at this point,
> i.e. you have not ran the balancer.
>
> In view of the above quoted statement, will these two nodes not
> participate in the MapReduce job until you balanced some data onto those
> nodes? Please kindly elaborate.
>
>
>
> Rgds,
>
> AK47
>  NOTICE: This e-mail message and any attachments are confidential, subject
> to copyright and may be privileged. Any unauthorized use, copying or
> disclosure is prohibited. If you are not the intended recipient, please
> delete and contact the sender immediately. Please consider the environment
> before printing this e-mail. AVIS : le présent courriel et toute pièce
> jointe qui l'accompagne sont confidentiels, protégés par le droit d'auteur
> et peuvent être couverts par le secret professionnel. Toute utilisation,
> copie ou divulgation non autorisée est interdite. Si vous n'êtes pas le
> destinataire prévu de ce courriel, supprimez-le et contactez immédiatement
> l'expéditeur. Veuillez penser à l'environnement avant d'imprimer le présent
> courriel
>

-- 
Jay Vyas
http://jayunit100.blogspot.com

Re: Hadoop processing

Posted by Jay Vyas <ja...@gmail.com>.

Hmm this is interesting.  I think that:

1) For the map phases, hadoop is smart enough to try to run mappers
locally, but i think you could force these DNs to actively participate in a
Mapper job by decreasing the size of input splits, which would allow for
many more mappers, some of which would be forced to run on files which were
not necessarily local - in this scenario, those DNs don't yet have any
local files on them that would be used for the input.

2) For the reducer phases - since of course the reducers will be copying
mapper outputs from all over the cluster, one would expect that your Data
nodes would naturally take part in this portion of the task if the
num.reducers parameter was specified.

On Thu, Nov 8, 2012 at 9:35 AM, Kartashov, Andy <An...@mpac.ca>wrote:

>  Hadoopers,
>
> “Hadoop ships the code to the data instead of sending the data to the
> code.”
>
> Say you added two DNs/TTs to the cluster. They have no data at this point,
> i.e. you have not ran the balancer.
>
> In view of the above quoted statement, will these two nodes not
> participate in the MapReduce job until you balanced some data onto those
> nodes? Please kindly elaborate.
>
>
>
> Rgds,
>
> AK47
>  NOTICE: This e-mail message and any attachments are confidential, subject
> to copyright and may be privileged. Any unauthorized use, copying or
> disclosure is prohibited. If you are not the intended recipient, please
> delete and contact the sender immediately. Please consider the environment
> before printing this e-mail. AVIS : le présent courriel et toute pièce
> jointe qui l'accompagne sont confidentiels, protégés par le droit d'auteur
> et peuvent être couverts par le secret professionnel. Toute utilisation,
> copie ou divulgation non autorisée est interdite. Si vous n'êtes pas le
> destinataire prévu de ce courriel, supprimez-le et contactez immédiatement
> l'expéditeur. Veuillez penser à l'environnement avant d'imprimer le présent
> courriel
>

-- 
Jay Vyas
http://jayunit100.blogspot.com

Re: Hadoop processing

Posted by Jay Vyas <ja...@gmail.com>.

Hmm this is interesting.  I think that:

1) For the map phases, hadoop is smart enough to try to run mappers
locally, but i think you could force these DNs to actively participate in a
Mapper job by decreasing the size of input splits, which would allow for
many more mappers, some of which would be forced to run on files which were
not necessarily local - in this scenario, those DNs don't yet have any
local files on them that would be used for the input.

2) For the reducer phases - since of course the reducers will be copying
mapper outputs from all over the cluster, one would expect that your Data
nodes would naturally take part in this portion of the task if the
num.reducers parameter was specified.

On Thu, Nov 8, 2012 at 9:35 AM, Kartashov, Andy <An...@mpac.ca>wrote:

>  Hadoopers,
>
> “Hadoop ships the code to the data instead of sending the data to the
> code.”
>
> Say you added two DNs/TTs to the cluster. They have no data at this point,
> i.e. you have not ran the balancer.
>
> In view of the above quoted statement, will these two nodes not
> participate in the MapReduce job until you balanced some data onto those
> nodes? Please kindly elaborate.
>
>
>
> Rgds,
>
> AK47
>  NOTICE: This e-mail message and any attachments are confidential, subject
> to copyright and may be privileged. Any unauthorized use, copying or
> disclosure is prohibited. If you are not the intended recipient, please
> delete and contact the sender immediately. Please consider the environment
> before printing this e-mail. AVIS : le présent courriel et toute pièce
> jointe qui l'accompagne sont confidentiels, protégés par le droit d'auteur
> et peuvent être couverts par le secret professionnel. Toute utilisation,
> copie ou divulgation non autorisée est interdite. Si vous n'êtes pas le
> destinataire prévu de ce courriel, supprimez-le et contactez immédiatement
> l'expéditeur. Veuillez penser à l'environnement avant d'imprimer le présent
> courriel
>

-- 
Jay Vyas
http://jayunit100.blogspot.com

Re: Hadoop processing

Posted by Jay Vyas <ja...@gmail.com>.

Hmm this is interesting.  I think that:

1) For the map phases, hadoop is smart enough to try to run mappers
locally, but i think you could force these DNs to actively participate in a
Mapper job by decreasing the size of input splits, which would allow for
many more mappers, some of which would be forced to run on files which were
not necessarily local - in this scenario, those DNs don't yet have any
local files on them that would be used for the input.

2) For the reducer phases - since of course the reducers will be copying
mapper outputs from all over the cluster, one would expect that your Data
nodes would naturally take part in this portion of the task if the
num.reducers parameter was specified.

On Thu, Nov 8, 2012 at 9:35 AM, Kartashov, Andy <An...@mpac.ca>wrote:

>  Hadoopers,
>
> “Hadoop ships the code to the data instead of sending the data to the
> code.”
>
> Say you added two DNs/TTs to the cluster. They have no data at this point,
> i.e. you have not ran the balancer.
>
> In view of the above quoted statement, will these two nodes not
> participate in the MapReduce job until you balanced some data onto those
> nodes? Please kindly elaborate.
>
>
>
> Rgds,
>
> AK47
>  NOTICE: This e-mail message and any attachments are confidential, subject
> to copyright and may be privileged. Any unauthorized use, copying or
> disclosure is prohibited. If you are not the intended recipient, please
> delete and contact the sender immediately. Please consider the environment
> before printing this e-mail. AVIS : le présent courriel et toute pièce
> jointe qui l'accompagne sont confidentiels, protégés par le droit d'auteur
> et peuvent être couverts par le secret professionnel. Toute utilisation,
> copie ou divulgation non autorisée est interdite. Si vous n'êtes pas le
> destinataire prévu de ce courriel, supprimez-le et contactez immédiatement
> l'expéditeur. Veuillez penser à l'environnement avant d'imprimer le présent
> courriel
>

-- 
Jay Vyas
http://jayunit100.blogspot.com