You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Julian Bui <ju...@gmail.com> on 2013/03/05 12:49:24 UTC

basic question about rack awareness and computation migration

Hi hadoop users,

I'm trying to find out if computation migration is something the developer
needs to worry about or if it's supposed to be hidden.

I would like to use hadoop to take in a list of image paths in the hdfs and
then have each task compress these large, raw images into something much
smaller - say jpeg  files.

Input: list of paths
Output: compressed jpeg

Since I don't really need a reduce task (I'm more using hadoop for its
reliability and orchestration aspects), my mapper ought to just take the
list of image paths and then work on them.  As I understand it, each image
will likely be on multiple data nodes.

My question is how will each mapper task "migrate the computation" to the
data nodes?  I recall reading that the namenode is supposed to deal with
this.  Is it hidden from the developer?  Or as the developer, do I need to
discover where the data lies and then migrate the task to that node?  Since
my input is just a list of paths, it seems like the namenode couldn't
really do this for me.

Another question: Where can I find out more about this?  I've looked up
"rack awareness" and "computation migration" but haven't really found much
code relating to either one - leading me to believe I'm not supposed to
have to write code to deal with this.

Anyway, could someone please help me out or set me straight on this?

Thanks,
-Julian

Re: basic question about rack awareness and computation migration

Posted by Julian Bui <ju...@gmail.com>.
Hi Rohit,

Thanks for responding.

> a task can be scheduled by hadoop to be executed on the same node which
is having data.

In my case, the mapper won't actually know where the data resides at the
time of being scheduled.  It only knows what data it will be accessing when
it reads in the keys.  In other words, the task will be already be running
by the time the mapper figures out what data must be accessed - so how can
hadoop know where to execute the code?

I'm still lost.  Please help if you can.

-Julian

On Tue, Mar 5, 2013 at 11:15 AM, Rohit Kochar <mn...@gmail.com> wrote:

> Hello ,
> To be precise this is hidden from the developer and you need not write any
> code for this.
> Whenever any file is stored in HDFS than it is splitted into block size of
> configured size and each block could potentially be stored on different
> datanode.All this information of which file contains which blocks resides
> with the namenode.
>
> So essentially whenever a file is accessed via DFS Client it requests the
>  NameNode for metadata,
> which DFS client uses to provide the file in streaming fashion to enduser.
>
> Since namenode knows the location of all the blocks/files ,a task can be
> scheduled by hadoop to be executed on the same node which is having data.
>
> Thanks
> Rohit Kochar
>
> On 05-Mar-2013, at 5:19 PM, Julian Bui wrote:
>
> > Hi hadoop users,
> >
> > I'm trying to find out if computation migration is something the
> developer needs to worry about or if it's supposed to be hidden.
> >
> > I would like to use hadoop to take in a list of image paths in the hdfs
> and then have each task compress these large, raw images into something
> much smaller - say jpeg  files.
> >
> > Input: list of paths
> > Output: compressed jpeg
> >
> > Since I don't really need a reduce task (I'm more using hadoop for its
> reliability and orchestration aspects), my mapper ought to just take the
> list of image paths and then work on them.  As I understand it, each image
> will likely be on multiple data nodes.
> >
> > My question is how will each mapper task "migrate the computation" to
> the data nodes?  I recall reading that the namenode is supposed to deal
> with this.  Is it hidden from the developer?  Or as the developer, do I
> need to discover where the data lies and then migrate the task to that
> node?  Since my input is just a list of paths, it seems like the namenode
> couldn't really do this for me.
> >
> > Another question: Where can I find out more about this?  I've looked up
> "rack awareness" and "computation migration" but haven't really found much
> code relating to either one - leading me to believe I'm not supposed to
> have to write code to deal with this.
> >
> > Anyway, could someone please help me out or set me straight on this?
> >
> > Thanks,
> > -Julian
>
>

Re: basic question about rack awareness and computation migration

Posted by Julian Bui <ju...@gmail.com>.
Hi Rohit,

Thanks for responding.

> a task can be scheduled by hadoop to be executed on the same node which
is having data.

In my case, the mapper won't actually know where the data resides at the
time of being scheduled.  It only knows what data it will be accessing when
it reads in the keys.  In other words, the task will be already be running
by the time the mapper figures out what data must be accessed - so how can
hadoop know where to execute the code?

I'm still lost.  Please help if you can.

-Julian

On Tue, Mar 5, 2013 at 11:15 AM, Rohit Kochar <mn...@gmail.com> wrote:

> Hello ,
> To be precise this is hidden from the developer and you need not write any
> code for this.
> Whenever any file is stored in HDFS than it is splitted into block size of
> configured size and each block could potentially be stored on different
> datanode.All this information of which file contains which blocks resides
> with the namenode.
>
> So essentially whenever a file is accessed via DFS Client it requests the
>  NameNode for metadata,
> which DFS client uses to provide the file in streaming fashion to enduser.
>
> Since namenode knows the location of all the blocks/files ,a task can be
> scheduled by hadoop to be executed on the same node which is having data.
>
> Thanks
> Rohit Kochar
>
> On 05-Mar-2013, at 5:19 PM, Julian Bui wrote:
>
> > Hi hadoop users,
> >
> > I'm trying to find out if computation migration is something the
> developer needs to worry about or if it's supposed to be hidden.
> >
> > I would like to use hadoop to take in a list of image paths in the hdfs
> and then have each task compress these large, raw images into something
> much smaller - say jpeg  files.
> >
> > Input: list of paths
> > Output: compressed jpeg
> >
> > Since I don't really need a reduce task (I'm more using hadoop for its
> reliability and orchestration aspects), my mapper ought to just take the
> list of image paths and then work on them.  As I understand it, each image
> will likely be on multiple data nodes.
> >
> > My question is how will each mapper task "migrate the computation" to
> the data nodes?  I recall reading that the namenode is supposed to deal
> with this.  Is it hidden from the developer?  Or as the developer, do I
> need to discover where the data lies and then migrate the task to that
> node?  Since my input is just a list of paths, it seems like the namenode
> couldn't really do this for me.
> >
> > Another question: Where can I find out more about this?  I've looked up
> "rack awareness" and "computation migration" but haven't really found much
> code relating to either one - leading me to believe I'm not supposed to
> have to write code to deal with this.
> >
> > Anyway, could someone please help me out or set me straight on this?
> >
> > Thanks,
> > -Julian
>
>

Re: basic question about rack awareness and computation migration

Posted by Julian Bui <ju...@gmail.com>.
Hi Rohit,

Thanks for responding.

> a task can be scheduled by hadoop to be executed on the same node which
is having data.

In my case, the mapper won't actually know where the data resides at the
time of being scheduled.  It only knows what data it will be accessing when
it reads in the keys.  In other words, the task will be already be running
by the time the mapper figures out what data must be accessed - so how can
hadoop know where to execute the code?

I'm still lost.  Please help if you can.

-Julian

On Tue, Mar 5, 2013 at 11:15 AM, Rohit Kochar <mn...@gmail.com> wrote:

> Hello ,
> To be precise this is hidden from the developer and you need not write any
> code for this.
> Whenever any file is stored in HDFS than it is splitted into block size of
> configured size and each block could potentially be stored on different
> datanode.All this information of which file contains which blocks resides
> with the namenode.
>
> So essentially whenever a file is accessed via DFS Client it requests the
>  NameNode for metadata,
> which DFS client uses to provide the file in streaming fashion to enduser.
>
> Since namenode knows the location of all the blocks/files ,a task can be
> scheduled by hadoop to be executed on the same node which is having data.
>
> Thanks
> Rohit Kochar
>
> On 05-Mar-2013, at 5:19 PM, Julian Bui wrote:
>
> > Hi hadoop users,
> >
> > I'm trying to find out if computation migration is something the
> developer needs to worry about or if it's supposed to be hidden.
> >
> > I would like to use hadoop to take in a list of image paths in the hdfs
> and then have each task compress these large, raw images into something
> much smaller - say jpeg  files.
> >
> > Input: list of paths
> > Output: compressed jpeg
> >
> > Since I don't really need a reduce task (I'm more using hadoop for its
> reliability and orchestration aspects), my mapper ought to just take the
> list of image paths and then work on them.  As I understand it, each image
> will likely be on multiple data nodes.
> >
> > My question is how will each mapper task "migrate the computation" to
> the data nodes?  I recall reading that the namenode is supposed to deal
> with this.  Is it hidden from the developer?  Or as the developer, do I
> need to discover where the data lies and then migrate the task to that
> node?  Since my input is just a list of paths, it seems like the namenode
> couldn't really do this for me.
> >
> > Another question: Where can I find out more about this?  I've looked up
> "rack awareness" and "computation migration" but haven't really found much
> code relating to either one - leading me to believe I'm not supposed to
> have to write code to deal with this.
> >
> > Anyway, could someone please help me out or set me straight on this?
> >
> > Thanks,
> > -Julian
>
>

Re: basic question about rack awareness and computation migration

Posted by Julian Bui <ju...@gmail.com>.
Hi Rohit,

Thanks for responding.

> a task can be scheduled by hadoop to be executed on the same node which
is having data.

In my case, the mapper won't actually know where the data resides at the
time of being scheduled.  It only knows what data it will be accessing when
it reads in the keys.  In other words, the task will be already be running
by the time the mapper figures out what data must be accessed - so how can
hadoop know where to execute the code?

I'm still lost.  Please help if you can.

-Julian

On Tue, Mar 5, 2013 at 11:15 AM, Rohit Kochar <mn...@gmail.com> wrote:

> Hello ,
> To be precise this is hidden from the developer and you need not write any
> code for this.
> Whenever any file is stored in HDFS than it is splitted into block size of
> configured size and each block could potentially be stored on different
> datanode.All this information of which file contains which blocks resides
> with the namenode.
>
> So essentially whenever a file is accessed via DFS Client it requests the
>  NameNode for metadata,
> which DFS client uses to provide the file in streaming fashion to enduser.
>
> Since namenode knows the location of all the blocks/files ,a task can be
> scheduled by hadoop to be executed on the same node which is having data.
>
> Thanks
> Rohit Kochar
>
> On 05-Mar-2013, at 5:19 PM, Julian Bui wrote:
>
> > Hi hadoop users,
> >
> > I'm trying to find out if computation migration is something the
> developer needs to worry about or if it's supposed to be hidden.
> >
> > I would like to use hadoop to take in a list of image paths in the hdfs
> and then have each task compress these large, raw images into something
> much smaller - say jpeg  files.
> >
> > Input: list of paths
> > Output: compressed jpeg
> >
> > Since I don't really need a reduce task (I'm more using hadoop for its
> reliability and orchestration aspects), my mapper ought to just take the
> list of image paths and then work on them.  As I understand it, each image
> will likely be on multiple data nodes.
> >
> > My question is how will each mapper task "migrate the computation" to
> the data nodes?  I recall reading that the namenode is supposed to deal
> with this.  Is it hidden from the developer?  Or as the developer, do I
> need to discover where the data lies and then migrate the task to that
> node?  Since my input is just a list of paths, it seems like the namenode
> couldn't really do this for me.
> >
> > Another question: Where can I find out more about this?  I've looked up
> "rack awareness" and "computation migration" but haven't really found much
> code relating to either one - leading me to believe I'm not supposed to
> have to write code to deal with this.
> >
> > Anyway, could someone please help me out or set me straight on this?
> >
> > Thanks,
> > -Julian
>
>

Re: basic question about rack awareness and computation migration

Posted by Rohit Kochar <mn...@gmail.com>.
Hello ,
To be precise this is hidden from the developer and you need not write any code for this.
Whenever any file is stored in HDFS than it is splitted into block size of configured size and each block could potentially be stored on different datanode.All this information of which file contains which blocks resides with the namenode.

So essentially whenever a file is accessed via DFS Client it requests the  NameNode for metadata,
which DFS client uses to provide the file in streaming fashion to enduser.

Since namenode knows the location of all the blocks/files ,a task can be scheduled by hadoop to be executed on the same node which is having data.

Thanks
Rohit Kochar

On 05-Mar-2013, at 5:19 PM, Julian Bui wrote:

> Hi hadoop users,
> 
> I'm trying to find out if computation migration is something the developer needs to worry about or if it's supposed to be hidden.
> 
> I would like to use hadoop to take in a list of image paths in the hdfs and then have each task compress these large, raw images into something much smaller - say jpeg  files.  
> 
> Input: list of paths
> Output: compressed jpeg
> 
> Since I don't really need a reduce task (I'm more using hadoop for its reliability and orchestration aspects), my mapper ought to just take the list of image paths and then work on them.  As I understand it, each image will likely be on multiple data nodes.  
> 
> My question is how will each mapper task "migrate the computation" to the data nodes?  I recall reading that the namenode is supposed to deal with this.  Is it hidden from the developer?  Or as the developer, do I need to discover where the data lies and then migrate the task to that node?  Since my input is just a list of paths, it seems like the namenode couldn't really do this for me.
> 
> Another question: Where can I find out more about this?  I've looked up "rack awareness" and "computation migration" but haven't really found much code relating to either one - leading me to believe I'm not supposed to have to write code to deal with this.
> 
> Anyway, could someone please help me out or set me straight on this?
> 
> Thanks,
> -Julian


Re: basic question about rack awareness and computation migration

Posted by Rohit Kochar <mn...@gmail.com>.
Hello ,
To be precise this is hidden from the developer and you need not write any code for this.
Whenever any file is stored in HDFS than it is splitted into block size of configured size and each block could potentially be stored on different datanode.All this information of which file contains which blocks resides with the namenode.

So essentially whenever a file is accessed via DFS Client it requests the  NameNode for metadata,
which DFS client uses to provide the file in streaming fashion to enduser.

Since namenode knows the location of all the blocks/files ,a task can be scheduled by hadoop to be executed on the same node which is having data.

Thanks
Rohit Kochar

On 05-Mar-2013, at 5:19 PM, Julian Bui wrote:

> Hi hadoop users,
> 
> I'm trying to find out if computation migration is something the developer needs to worry about or if it's supposed to be hidden.
> 
> I would like to use hadoop to take in a list of image paths in the hdfs and then have each task compress these large, raw images into something much smaller - say jpeg  files.  
> 
> Input: list of paths
> Output: compressed jpeg
> 
> Since I don't really need a reduce task (I'm more using hadoop for its reliability and orchestration aspects), my mapper ought to just take the list of image paths and then work on them.  As I understand it, each image will likely be on multiple data nodes.  
> 
> My question is how will each mapper task "migrate the computation" to the data nodes?  I recall reading that the namenode is supposed to deal with this.  Is it hidden from the developer?  Or as the developer, do I need to discover where the data lies and then migrate the task to that node?  Since my input is just a list of paths, it seems like the namenode couldn't really do this for me.
> 
> Another question: Where can I find out more about this?  I've looked up "rack awareness" and "computation migration" but haven't really found much code relating to either one - leading me to believe I'm not supposed to have to write code to deal with this.
> 
> Anyway, could someone please help me out or set me straight on this?
> 
> Thanks,
> -Julian


Re: basic question about rack awareness and computation migration

Posted by Shumin Guo <gs...@gmail.com>.
Yes, I agree with Bertrand. Hadoop can take a whole file as input and you
just put your compression code into the map method, and use the identity
reduce function that simply writes your compressed data on to HDFS by using
the file output format.

Thanks,

On Thu, Mar 7, 2013 at 7:35 AM, Bertrand Dechoux <de...@gmail.com> wrote:

> I might have missed something but is there a reason for the input of the
> mappers to be a list of files and not the files themselves?
> The usual way is to provide a path to the files that should be processed
> and then Hadoop will figure for you how to best use data locality.
> Is there a reason for not doing that?
>
> How big is each image file? How are they stored?
>
> You could create an input format not splittable (it is a simple property),
> that way you are sure that a mapper will process the whole file.
> And then trivially your mapper compresses the provided image, Hadoop will
> use a mapper per file and deals with data locality by itself.
>
> Regards
>
> Bertrand
>
>
> On Wed, Mar 6, 2013 at 4:43 AM, Julian Bui <ju...@gmail.com> wrote:
>
>> Thanks Harsh,
>>
>> > Are your input lists big (for each compressed output)? And is the list
>> arbitrary or a defined list per goal?
>>
>> I dictate what my inputs will look like.  If they need to be list of
>> image files, then I can do that.  If they need to be the images themselves
>> as you suggest, then I can do that too but I'm not exactly sure what that
>> would look like.  Basically, I will try to format my inputs in the way that
>> makes the most sense from a locality point of view.
>>
>> Since all the keys must be writable, I explored the Writable interface
>> and found the interesting sub-classes:
>>
>>    - FileSplit
>>    - BlockLocation
>>    - BytesWritable
>>
>> These all look somewhat promising as they kind of reveal the location
>> information of the files.
>>
>> I'm not exactly sure how I would use these to hint at the data locations.
>>  Since these chunks of the file appear to be somewhat arbitrary in size and
>> offset, I don't know how I could perform imagery operations on them.  For
>> example, if I knew that bytes 0x100-0x400 lie on node X, then that makes it
>> difficult for me to use that information to give to my image libraries -
>> does 0x100-0x400 correspond to some region/MBR within the image?  I'm not
>> sure how to make use of this information.
>>
>> The responses I've gotten so far indicate to me that HDFS kind of does
>> the computation migration for me but that I have to give it enough
>> information to work with.  If someone could point to some detailed reading
>> about this subject that would be pretty helpful, as I just can't find the
>> documentation for it.
>>
>> Thanks again,
>> -Julian
>>
>> On Tue, Mar 5, 2013 at 5:39 PM, Harsh J <ha...@cloudera.com> wrote:
>>
>>> Your concern is correct: If your input is a list of files, rather than
>>> the files themselves, then the tasks would not be data-local - since
>>> the task input would just be the list of files, and the files' data
>>> may reside on any node/rack of the cluster.
>>>
>>> However, your job will still run as the HDFS reads do remote reads
>>> transparently without developer intervention and all will still work
>>> as you've written it to. If a block is found local to the DN, it is
>>> read locally as well - all of this is automatic.
>>>
>>> Are your input lists big (for each compressed output)? And is the list
>>> arbitrary or a defined list per goal?
>>>
>>> On Tue, Mar 5, 2013 at 5:19 PM, Julian Bui <ju...@gmail.com> wrote:
>>> > Hi hadoop users,
>>> >
>>> > I'm trying to find out if computation migration is something the
>>> developer
>>> > needs to worry about or if it's supposed to be hidden.
>>> >
>>> > I would like to use hadoop to take in a list of image paths in the
>>> hdfs and
>>> > then have each task compress these large, raw images into something
>>> much
>>> > smaller - say jpeg  files.
>>> >
>>> > Input: list of paths
>>> > Output: compressed jpeg
>>> >
>>> > Since I don't really need a reduce task (I'm more using hadoop for its
>>> > reliability and orchestration aspects), my mapper ought to just take
>>> the
>>> > list of image paths and then work on them.  As I understand it, each
>>> image
>>> > will likely be on multiple data nodes.
>>> >
>>> > My question is how will each mapper task "migrate the computation" to
>>> the
>>> > data nodes?  I recall reading that the namenode is supposed to deal
>>> with
>>> > this.  Is it hidden from the developer?  Or as the developer, do I
>>> need to
>>> > discover where the data lies and then migrate the task to that node?
>>>  Since
>>> > my input is just a list of paths, it seems like the namenode couldn't
>>> really
>>> > do this for me.
>>> >
>>> > Another question: Where can I find out more about this?  I've looked up
>>> > "rack awareness" and "computation migration" but haven't really found
>>> much
>>> > code relating to either one - leading me to believe I'm not supposed
>>> to have
>>> > to write code to deal with this.
>>> >
>>> > Anyway, could someone please help me out or set me straight on this?
>>> >
>>> > Thanks,
>>> > -Julian
>>>
>>>
>>>
>>> --
>>> Harsh J
>>>
>>
>>
>

Re: basic question about rack awareness and computation migration

Posted by Shumin Guo <gs...@gmail.com>.
Yes, I agree with Bertrand. Hadoop can take a whole file as input and you
just put your compression code into the map method, and use the identity
reduce function that simply writes your compressed data on to HDFS by using
the file output format.

Thanks,

On Thu, Mar 7, 2013 at 7:35 AM, Bertrand Dechoux <de...@gmail.com> wrote:

> I might have missed something but is there a reason for the input of the
> mappers to be a list of files and not the files themselves?
> The usual way is to provide a path to the files that should be processed
> and then Hadoop will figure for you how to best use data locality.
> Is there a reason for not doing that?
>
> How big is each image file? How are they stored?
>
> You could create an input format not splittable (it is a simple property),
> that way you are sure that a mapper will process the whole file.
> And then trivially your mapper compresses the provided image, Hadoop will
> use a mapper per file and deals with data locality by itself.
>
> Regards
>
> Bertrand
>
>
> On Wed, Mar 6, 2013 at 4:43 AM, Julian Bui <ju...@gmail.com> wrote:
>
>> Thanks Harsh,
>>
>> > Are your input lists big (for each compressed output)? And is the list
>> arbitrary or a defined list per goal?
>>
>> I dictate what my inputs will look like.  If they need to be list of
>> image files, then I can do that.  If they need to be the images themselves
>> as you suggest, then I can do that too but I'm not exactly sure what that
>> would look like.  Basically, I will try to format my inputs in the way that
>> makes the most sense from a locality point of view.
>>
>> Since all the keys must be writable, I explored the Writable interface
>> and found the interesting sub-classes:
>>
>>    - FileSplit
>>    - BlockLocation
>>    - BytesWritable
>>
>> These all look somewhat promising as they kind of reveal the location
>> information of the files.
>>
>> I'm not exactly sure how I would use these to hint at the data locations.
>>  Since these chunks of the file appear to be somewhat arbitrary in size and
>> offset, I don't know how I could perform imagery operations on them.  For
>> example, if I knew that bytes 0x100-0x400 lie on node X, then that makes it
>> difficult for me to use that information to give to my image libraries -
>> does 0x100-0x400 correspond to some region/MBR within the image?  I'm not
>> sure how to make use of this information.
>>
>> The responses I've gotten so far indicate to me that HDFS kind of does
>> the computation migration for me but that I have to give it enough
>> information to work with.  If someone could point to some detailed reading
>> about this subject that would be pretty helpful, as I just can't find the
>> documentation for it.
>>
>> Thanks again,
>> -Julian
>>
>> On Tue, Mar 5, 2013 at 5:39 PM, Harsh J <ha...@cloudera.com> wrote:
>>
>>> Your concern is correct: If your input is a list of files, rather than
>>> the files themselves, then the tasks would not be data-local - since
>>> the task input would just be the list of files, and the files' data
>>> may reside on any node/rack of the cluster.
>>>
>>> However, your job will still run as the HDFS reads do remote reads
>>> transparently without developer intervention and all will still work
>>> as you've written it to. If a block is found local to the DN, it is
>>> read locally as well - all of this is automatic.
>>>
>>> Are your input lists big (for each compressed output)? And is the list
>>> arbitrary or a defined list per goal?
>>>
>>> On Tue, Mar 5, 2013 at 5:19 PM, Julian Bui <ju...@gmail.com> wrote:
>>> > Hi hadoop users,
>>> >
>>> > I'm trying to find out if computation migration is something the
>>> developer
>>> > needs to worry about or if it's supposed to be hidden.
>>> >
>>> > I would like to use hadoop to take in a list of image paths in the
>>> hdfs and
>>> > then have each task compress these large, raw images into something
>>> much
>>> > smaller - say jpeg  files.
>>> >
>>> > Input: list of paths
>>> > Output: compressed jpeg
>>> >
>>> > Since I don't really need a reduce task (I'm more using hadoop for its
>>> > reliability and orchestration aspects), my mapper ought to just take
>>> the
>>> > list of image paths and then work on them.  As I understand it, each
>>> image
>>> > will likely be on multiple data nodes.
>>> >
>>> > My question is how will each mapper task "migrate the computation" to
>>> the
>>> > data nodes?  I recall reading that the namenode is supposed to deal
>>> with
>>> > this.  Is it hidden from the developer?  Or as the developer, do I
>>> need to
>>> > discover where the data lies and then migrate the task to that node?
>>>  Since
>>> > my input is just a list of paths, it seems like the namenode couldn't
>>> really
>>> > do this for me.
>>> >
>>> > Another question: Where can I find out more about this?  I've looked up
>>> > "rack awareness" and "computation migration" but haven't really found
>>> much
>>> > code relating to either one - leading me to believe I'm not supposed
>>> to have
>>> > to write code to deal with this.
>>> >
>>> > Anyway, could someone please help me out or set me straight on this?
>>> >
>>> > Thanks,
>>> > -Julian
>>>
>>>
>>>
>>> --
>>> Harsh J
>>>
>>
>>
>

Re: basic question about rack awareness and computation migration

Posted by Shumin Guo <gs...@gmail.com>.
Yes, I agree with Bertrand. Hadoop can take a whole file as input and you
just put your compression code into the map method, and use the identity
reduce function that simply writes your compressed data on to HDFS by using
the file output format.

Thanks,

On Thu, Mar 7, 2013 at 7:35 AM, Bertrand Dechoux <de...@gmail.com> wrote:

> I might have missed something but is there a reason for the input of the
> mappers to be a list of files and not the files themselves?
> The usual way is to provide a path to the files that should be processed
> and then Hadoop will figure for you how to best use data locality.
> Is there a reason for not doing that?
>
> How big is each image file? How are they stored?
>
> You could create an input format not splittable (it is a simple property),
> that way you are sure that a mapper will process the whole file.
> And then trivially your mapper compresses the provided image, Hadoop will
> use a mapper per file and deals with data locality by itself.
>
> Regards
>
> Bertrand
>
>
> On Wed, Mar 6, 2013 at 4:43 AM, Julian Bui <ju...@gmail.com> wrote:
>
>> Thanks Harsh,
>>
>> > Are your input lists big (for each compressed output)? And is the list
>> arbitrary or a defined list per goal?
>>
>> I dictate what my inputs will look like.  If they need to be list of
>> image files, then I can do that.  If they need to be the images themselves
>> as you suggest, then I can do that too but I'm not exactly sure what that
>> would look like.  Basically, I will try to format my inputs in the way that
>> makes the most sense from a locality point of view.
>>
>> Since all the keys must be writable, I explored the Writable interface
>> and found the interesting sub-classes:
>>
>>    - FileSplit
>>    - BlockLocation
>>    - BytesWritable
>>
>> These all look somewhat promising as they kind of reveal the location
>> information of the files.
>>
>> I'm not exactly sure how I would use these to hint at the data locations.
>>  Since these chunks of the file appear to be somewhat arbitrary in size and
>> offset, I don't know how I could perform imagery operations on them.  For
>> example, if I knew that bytes 0x100-0x400 lie on node X, then that makes it
>> difficult for me to use that information to give to my image libraries -
>> does 0x100-0x400 correspond to some region/MBR within the image?  I'm not
>> sure how to make use of this information.
>>
>> The responses I've gotten so far indicate to me that HDFS kind of does
>> the computation migration for me but that I have to give it enough
>> information to work with.  If someone could point to some detailed reading
>> about this subject that would be pretty helpful, as I just can't find the
>> documentation for it.
>>
>> Thanks again,
>> -Julian
>>
>> On Tue, Mar 5, 2013 at 5:39 PM, Harsh J <ha...@cloudera.com> wrote:
>>
>>> Your concern is correct: If your input is a list of files, rather than
>>> the files themselves, then the tasks would not be data-local - since
>>> the task input would just be the list of files, and the files' data
>>> may reside on any node/rack of the cluster.
>>>
>>> However, your job will still run as the HDFS reads do remote reads
>>> transparently without developer intervention and all will still work
>>> as you've written it to. If a block is found local to the DN, it is
>>> read locally as well - all of this is automatic.
>>>
>>> Are your input lists big (for each compressed output)? And is the list
>>> arbitrary or a defined list per goal?
>>>
>>> On Tue, Mar 5, 2013 at 5:19 PM, Julian Bui <ju...@gmail.com> wrote:
>>> > Hi hadoop users,
>>> >
>>> > I'm trying to find out if computation migration is something the
>>> developer
>>> > needs to worry about or if it's supposed to be hidden.
>>> >
>>> > I would like to use hadoop to take in a list of image paths in the
>>> hdfs and
>>> > then have each task compress these large, raw images into something
>>> much
>>> > smaller - say jpeg  files.
>>> >
>>> > Input: list of paths
>>> > Output: compressed jpeg
>>> >
>>> > Since I don't really need a reduce task (I'm more using hadoop for its
>>> > reliability and orchestration aspects), my mapper ought to just take
>>> the
>>> > list of image paths and then work on them.  As I understand it, each
>>> image
>>> > will likely be on multiple data nodes.
>>> >
>>> > My question is how will each mapper task "migrate the computation" to
>>> the
>>> > data nodes?  I recall reading that the namenode is supposed to deal
>>> with
>>> > this.  Is it hidden from the developer?  Or as the developer, do I
>>> need to
>>> > discover where the data lies and then migrate the task to that node?
>>>  Since
>>> > my input is just a list of paths, it seems like the namenode couldn't
>>> really
>>> > do this for me.
>>> >
>>> > Another question: Where can I find out more about this?  I've looked up
>>> > "rack awareness" and "computation migration" but haven't really found
>>> much
>>> > code relating to either one - leading me to believe I'm not supposed
>>> to have
>>> > to write code to deal with this.
>>> >
>>> > Anyway, could someone please help me out or set me straight on this?
>>> >
>>> > Thanks,
>>> > -Julian
>>>
>>>
>>>
>>> --
>>> Harsh J
>>>
>>
>>
>

Re: basic question about rack awareness and computation migration

Posted by Shumin Guo <gs...@gmail.com>.
Yes, I agree with Bertrand. Hadoop can take a whole file as input and you
just put your compression code into the map method, and use the identity
reduce function that simply writes your compressed data on to HDFS by using
the file output format.

Thanks,

On Thu, Mar 7, 2013 at 7:35 AM, Bertrand Dechoux <de...@gmail.com> wrote:

> I might have missed something but is there a reason for the input of the
> mappers to be a list of files and not the files themselves?
> The usual way is to provide a path to the files that should be processed
> and then Hadoop will figure for you how to best use data locality.
> Is there a reason for not doing that?
>
> How big is each image file? How are they stored?
>
> You could create an input format not splittable (it is a simple property),
> that way you are sure that a mapper will process the whole file.
> And then trivially your mapper compresses the provided image, Hadoop will
> use a mapper per file and deals with data locality by itself.
>
> Regards
>
> Bertrand
>
>
> On Wed, Mar 6, 2013 at 4:43 AM, Julian Bui <ju...@gmail.com> wrote:
>
>> Thanks Harsh,
>>
>> > Are your input lists big (for each compressed output)? And is the list
>> arbitrary or a defined list per goal?
>>
>> I dictate what my inputs will look like.  If they need to be list of
>> image files, then I can do that.  If they need to be the images themselves
>> as you suggest, then I can do that too but I'm not exactly sure what that
>> would look like.  Basically, I will try to format my inputs in the way that
>> makes the most sense from a locality point of view.
>>
>> Since all the keys must be writable, I explored the Writable interface
>> and found the interesting sub-classes:
>>
>>    - FileSplit
>>    - BlockLocation
>>    - BytesWritable
>>
>> These all look somewhat promising as they kind of reveal the location
>> information of the files.
>>
>> I'm not exactly sure how I would use these to hint at the data locations.
>>  Since these chunks of the file appear to be somewhat arbitrary in size and
>> offset, I don't know how I could perform imagery operations on them.  For
>> example, if I knew that bytes 0x100-0x400 lie on node X, then that makes it
>> difficult for me to use that information to give to my image libraries -
>> does 0x100-0x400 correspond to some region/MBR within the image?  I'm not
>> sure how to make use of this information.
>>
>> The responses I've gotten so far indicate to me that HDFS kind of does
>> the computation migration for me but that I have to give it enough
>> information to work with.  If someone could point to some detailed reading
>> about this subject that would be pretty helpful, as I just can't find the
>> documentation for it.
>>
>> Thanks again,
>> -Julian
>>
>> On Tue, Mar 5, 2013 at 5:39 PM, Harsh J <ha...@cloudera.com> wrote:
>>
>>> Your concern is correct: If your input is a list of files, rather than
>>> the files themselves, then the tasks would not be data-local - since
>>> the task input would just be the list of files, and the files' data
>>> may reside on any node/rack of the cluster.
>>>
>>> However, your job will still run as the HDFS reads do remote reads
>>> transparently without developer intervention and all will still work
>>> as you've written it to. If a block is found local to the DN, it is
>>> read locally as well - all of this is automatic.
>>>
>>> Are your input lists big (for each compressed output)? And is the list
>>> arbitrary or a defined list per goal?
>>>
>>> On Tue, Mar 5, 2013 at 5:19 PM, Julian Bui <ju...@gmail.com> wrote:
>>> > Hi hadoop users,
>>> >
>>> > I'm trying to find out if computation migration is something the
>>> developer
>>> > needs to worry about or if it's supposed to be hidden.
>>> >
>>> > I would like to use hadoop to take in a list of image paths in the
>>> hdfs and
>>> > then have each task compress these large, raw images into something
>>> much
>>> > smaller - say jpeg  files.
>>> >
>>> > Input: list of paths
>>> > Output: compressed jpeg
>>> >
>>> > Since I don't really need a reduce task (I'm more using hadoop for its
>>> > reliability and orchestration aspects), my mapper ought to just take
>>> the
>>> > list of image paths and then work on them.  As I understand it, each
>>> image
>>> > will likely be on multiple data nodes.
>>> >
>>> > My question is how will each mapper task "migrate the computation" to
>>> the
>>> > data nodes?  I recall reading that the namenode is supposed to deal
>>> with
>>> > this.  Is it hidden from the developer?  Or as the developer, do I
>>> need to
>>> > discover where the data lies and then migrate the task to that node?
>>>  Since
>>> > my input is just a list of paths, it seems like the namenode couldn't
>>> really
>>> > do this for me.
>>> >
>>> > Another question: Where can I find out more about this?  I've looked up
>>> > "rack awareness" and "computation migration" but haven't really found
>>> much
>>> > code relating to either one - leading me to believe I'm not supposed
>>> to have
>>> > to write code to deal with this.
>>> >
>>> > Anyway, could someone please help me out or set me straight on this?
>>> >
>>> > Thanks,
>>> > -Julian
>>>
>>>
>>>
>>> --
>>> Harsh J
>>>
>>
>>
>

Re: basic question about rack awareness and computation migration

Posted by Bertrand Dechoux <de...@gmail.com>.
I might have missed something but is there a reason for the input of the
mappers to be a list of files and not the files themselves?
The usual way is to provide a path to the files that should be processed
and then Hadoop will figure for you how to best use data locality.
Is there a reason for not doing that?

How big is each image file? How are they stored?

You could create an input format not splittable (it is a simple property),
that way you are sure that a mapper will process the whole file.
And then trivially your mapper compresses the provided image, Hadoop will
use a mapper per file and deals with data locality by itself.

Regards

Bertrand

On Wed, Mar 6, 2013 at 4:43 AM, Julian Bui <ju...@gmail.com> wrote:

> Thanks Harsh,
>
> > Are your input lists big (for each compressed output)? And is the list
> arbitrary or a defined list per goal?
>
> I dictate what my inputs will look like.  If they need to be list of image
> files, then I can do that.  If they need to be the images themselves as you
> suggest, then I can do that too but I'm not exactly sure what that would
> look like.  Basically, I will try to format my inputs in the way that makes
> the most sense from a locality point of view.
>
> Since all the keys must be writable, I explored the Writable interface and
> found the interesting sub-classes:
>
>    - FileSplit
>    - BlockLocation
>    - BytesWritable
>
> These all look somewhat promising as they kind of reveal the location
> information of the files.
>
> I'm not exactly sure how I would use these to hint at the data locations.
>  Since these chunks of the file appear to be somewhat arbitrary in size and
> offset, I don't know how I could perform imagery operations on them.  For
> example, if I knew that bytes 0x100-0x400 lie on node X, then that makes it
> difficult for me to use that information to give to my image libraries -
> does 0x100-0x400 correspond to some region/MBR within the image?  I'm not
> sure how to make use of this information.
>
> The responses I've gotten so far indicate to me that HDFS kind of does the
> computation migration for me but that I have to give it enough information
> to work with.  If someone could point to some detailed reading about this
> subject that would be pretty helpful, as I just can't find the
> documentation for it.
>
> Thanks again,
> -Julian
>
> On Tue, Mar 5, 2013 at 5:39 PM, Harsh J <ha...@cloudera.com> wrote:
>
>> Your concern is correct: If your input is a list of files, rather than
>> the files themselves, then the tasks would not be data-local - since
>> the task input would just be the list of files, and the files' data
>> may reside on any node/rack of the cluster.
>>
>> However, your job will still run as the HDFS reads do remote reads
>> transparently without developer intervention and all will still work
>> as you've written it to. If a block is found local to the DN, it is
>> read locally as well - all of this is automatic.
>>
>> Are your input lists big (for each compressed output)? And is the list
>> arbitrary or a defined list per goal?
>>
>> On Tue, Mar 5, 2013 at 5:19 PM, Julian Bui <ju...@gmail.com> wrote:
>> > Hi hadoop users,
>> >
>> > I'm trying to find out if computation migration is something the
>> developer
>> > needs to worry about or if it's supposed to be hidden.
>> >
>> > I would like to use hadoop to take in a list of image paths in the hdfs
>> and
>> > then have each task compress these large, raw images into something much
>> > smaller - say jpeg  files.
>> >
>> > Input: list of paths
>> > Output: compressed jpeg
>> >
>> > Since I don't really need a reduce task (I'm more using hadoop for its
>> > reliability and orchestration aspects), my mapper ought to just take the
>> > list of image paths and then work on them.  As I understand it, each
>> image
>> > will likely be on multiple data nodes.
>> >
>> > My question is how will each mapper task "migrate the computation" to
>> the
>> > data nodes?  I recall reading that the namenode is supposed to deal with
>> > this.  Is it hidden from the developer?  Or as the developer, do I need
>> to
>> > discover where the data lies and then migrate the task to that node?
>>  Since
>> > my input is just a list of paths, it seems like the namenode couldn't
>> really
>> > do this for me.
>> >
>> > Another question: Where can I find out more about this?  I've looked up
>> > "rack awareness" and "computation migration" but haven't really found
>> much
>> > code relating to either one - leading me to believe I'm not supposed to
>> have
>> > to write code to deal with this.
>> >
>> > Anyway, could someone please help me out or set me straight on this?
>> >
>> > Thanks,
>> > -Julian
>>
>>
>>
>> --
>> Harsh J
>>
>
>

Re: basic question about rack awareness and computation migration

Posted by Bertrand Dechoux <de...@gmail.com>.
I might have missed something but is there a reason for the input of the
mappers to be a list of files and not the files themselves?
The usual way is to provide a path to the files that should be processed
and then Hadoop will figure for you how to best use data locality.
Is there a reason for not doing that?

How big is each image file? How are they stored?

You could create an input format not splittable (it is a simple property),
that way you are sure that a mapper will process the whole file.
And then trivially your mapper compresses the provided image, Hadoop will
use a mapper per file and deals with data locality by itself.

Regards

Bertrand

On Wed, Mar 6, 2013 at 4:43 AM, Julian Bui <ju...@gmail.com> wrote:

> Thanks Harsh,
>
> > Are your input lists big (for each compressed output)? And is the list
> arbitrary or a defined list per goal?
>
> I dictate what my inputs will look like.  If they need to be list of image
> files, then I can do that.  If they need to be the images themselves as you
> suggest, then I can do that too but I'm not exactly sure what that would
> look like.  Basically, I will try to format my inputs in the way that makes
> the most sense from a locality point of view.
>
> Since all the keys must be writable, I explored the Writable interface and
> found the interesting sub-classes:
>
>    - FileSplit
>    - BlockLocation
>    - BytesWritable
>
> These all look somewhat promising as they kind of reveal the location
> information of the files.
>
> I'm not exactly sure how I would use these to hint at the data locations.
>  Since these chunks of the file appear to be somewhat arbitrary in size and
> offset, I don't know how I could perform imagery operations on them.  For
> example, if I knew that bytes 0x100-0x400 lie on node X, then that makes it
> difficult for me to use that information to give to my image libraries -
> does 0x100-0x400 correspond to some region/MBR within the image?  I'm not
> sure how to make use of this information.
>
> The responses I've gotten so far indicate to me that HDFS kind of does the
> computation migration for me but that I have to give it enough information
> to work with.  If someone could point to some detailed reading about this
> subject that would be pretty helpful, as I just can't find the
> documentation for it.
>
> Thanks again,
> -Julian
>
> On Tue, Mar 5, 2013 at 5:39 PM, Harsh J <ha...@cloudera.com> wrote:
>
>> Your concern is correct: If your input is a list of files, rather than
>> the files themselves, then the tasks would not be data-local - since
>> the task input would just be the list of files, and the files' data
>> may reside on any node/rack of the cluster.
>>
>> However, your job will still run as the HDFS reads do remote reads
>> transparently without developer intervention and all will still work
>> as you've written it to. If a block is found local to the DN, it is
>> read locally as well - all of this is automatic.
>>
>> Are your input lists big (for each compressed output)? And is the list
>> arbitrary or a defined list per goal?
>>
>> On Tue, Mar 5, 2013 at 5:19 PM, Julian Bui <ju...@gmail.com> wrote:
>> > Hi hadoop users,
>> >
>> > I'm trying to find out if computation migration is something the
>> developer
>> > needs to worry about or if it's supposed to be hidden.
>> >
>> > I would like to use hadoop to take in a list of image paths in the hdfs
>> and
>> > then have each task compress these large, raw images into something much
>> > smaller - say jpeg  files.
>> >
>> > Input: list of paths
>> > Output: compressed jpeg
>> >
>> > Since I don't really need a reduce task (I'm more using hadoop for its
>> > reliability and orchestration aspects), my mapper ought to just take the
>> > list of image paths and then work on them.  As I understand it, each
>> image
>> > will likely be on multiple data nodes.
>> >
>> > My question is how will each mapper task "migrate the computation" to
>> the
>> > data nodes?  I recall reading that the namenode is supposed to deal with
>> > this.  Is it hidden from the developer?  Or as the developer, do I need
>> to
>> > discover where the data lies and then migrate the task to that node?
>>  Since
>> > my input is just a list of paths, it seems like the namenode couldn't
>> really
>> > do this for me.
>> >
>> > Another question: Where can I find out more about this?  I've looked up
>> > "rack awareness" and "computation migration" but haven't really found
>> much
>> > code relating to either one - leading me to believe I'm not supposed to
>> have
>> > to write code to deal with this.
>> >
>> > Anyway, could someone please help me out or set me straight on this?
>> >
>> > Thanks,
>> > -Julian
>>
>>
>>
>> --
>> Harsh J
>>
>
>

Re: basic question about rack awareness and computation migration

Posted by Bertrand Dechoux <de...@gmail.com>.
I might have missed something but is there a reason for the input of the
mappers to be a list of files and not the files themselves?
The usual way is to provide a path to the files that should be processed
and then Hadoop will figure for you how to best use data locality.
Is there a reason for not doing that?

How big is each image file? How are they stored?

You could create an input format not splittable (it is a simple property),
that way you are sure that a mapper will process the whole file.
And then trivially your mapper compresses the provided image, Hadoop will
use a mapper per file and deals with data locality by itself.

Regards

Bertrand

On Wed, Mar 6, 2013 at 4:43 AM, Julian Bui <ju...@gmail.com> wrote:

> Thanks Harsh,
>
> > Are your input lists big (for each compressed output)? And is the list
> arbitrary or a defined list per goal?
>
> I dictate what my inputs will look like.  If they need to be list of image
> files, then I can do that.  If they need to be the images themselves as you
> suggest, then I can do that too but I'm not exactly sure what that would
> look like.  Basically, I will try to format my inputs in the way that makes
> the most sense from a locality point of view.
>
> Since all the keys must be writable, I explored the Writable interface and
> found the interesting sub-classes:
>
>    - FileSplit
>    - BlockLocation
>    - BytesWritable
>
> These all look somewhat promising as they kind of reveal the location
> information of the files.
>
> I'm not exactly sure how I would use these to hint at the data locations.
>  Since these chunks of the file appear to be somewhat arbitrary in size and
> offset, I don't know how I could perform imagery operations on them.  For
> example, if I knew that bytes 0x100-0x400 lie on node X, then that makes it
> difficult for me to use that information to give to my image libraries -
> does 0x100-0x400 correspond to some region/MBR within the image?  I'm not
> sure how to make use of this information.
>
> The responses I've gotten so far indicate to me that HDFS kind of does the
> computation migration for me but that I have to give it enough information
> to work with.  If someone could point to some detailed reading about this
> subject that would be pretty helpful, as I just can't find the
> documentation for it.
>
> Thanks again,
> -Julian
>
> On Tue, Mar 5, 2013 at 5:39 PM, Harsh J <ha...@cloudera.com> wrote:
>
>> Your concern is correct: If your input is a list of files, rather than
>> the files themselves, then the tasks would not be data-local - since
>> the task input would just be the list of files, and the files' data
>> may reside on any node/rack of the cluster.
>>
>> However, your job will still run as the HDFS reads do remote reads
>> transparently without developer intervention and all will still work
>> as you've written it to. If a block is found local to the DN, it is
>> read locally as well - all of this is automatic.
>>
>> Are your input lists big (for each compressed output)? And is the list
>> arbitrary or a defined list per goal?
>>
>> On Tue, Mar 5, 2013 at 5:19 PM, Julian Bui <ju...@gmail.com> wrote:
>> > Hi hadoop users,
>> >
>> > I'm trying to find out if computation migration is something the
>> developer
>> > needs to worry about or if it's supposed to be hidden.
>> >
>> > I would like to use hadoop to take in a list of image paths in the hdfs
>> and
>> > then have each task compress these large, raw images into something much
>> > smaller - say jpeg  files.
>> >
>> > Input: list of paths
>> > Output: compressed jpeg
>> >
>> > Since I don't really need a reduce task (I'm more using hadoop for its
>> > reliability and orchestration aspects), my mapper ought to just take the
>> > list of image paths and then work on them.  As I understand it, each
>> image
>> > will likely be on multiple data nodes.
>> >
>> > My question is how will each mapper task "migrate the computation" to
>> the
>> > data nodes?  I recall reading that the namenode is supposed to deal with
>> > this.  Is it hidden from the developer?  Or as the developer, do I need
>> to
>> > discover where the data lies and then migrate the task to that node?
>>  Since
>> > my input is just a list of paths, it seems like the namenode couldn't
>> really
>> > do this for me.
>> >
>> > Another question: Where can I find out more about this?  I've looked up
>> > "rack awareness" and "computation migration" but haven't really found
>> much
>> > code relating to either one - leading me to believe I'm not supposed to
>> have
>> > to write code to deal with this.
>> >
>> > Anyway, could someone please help me out or set me straight on this?
>> >
>> > Thanks,
>> > -Julian
>>
>>
>>
>> --
>> Harsh J
>>
>
>

Re: basic question about rack awareness and computation migration

Posted by Bertrand Dechoux <de...@gmail.com>.
I might have missed something but is there a reason for the input of the
mappers to be a list of files and not the files themselves?
The usual way is to provide a path to the files that should be processed
and then Hadoop will figure for you how to best use data locality.
Is there a reason for not doing that?

How big is each image file? How are they stored?

You could create an input format not splittable (it is a simple property),
that way you are sure that a mapper will process the whole file.
And then trivially your mapper compresses the provided image, Hadoop will
use a mapper per file and deals with data locality by itself.

Regards

Bertrand

On Wed, Mar 6, 2013 at 4:43 AM, Julian Bui <ju...@gmail.com> wrote:

> Thanks Harsh,
>
> > Are your input lists big (for each compressed output)? And is the list
> arbitrary or a defined list per goal?
>
> I dictate what my inputs will look like.  If they need to be list of image
> files, then I can do that.  If they need to be the images themselves as you
> suggest, then I can do that too but I'm not exactly sure what that would
> look like.  Basically, I will try to format my inputs in the way that makes
> the most sense from a locality point of view.
>
> Since all the keys must be writable, I explored the Writable interface and
> found the interesting sub-classes:
>
>    - FileSplit
>    - BlockLocation
>    - BytesWritable
>
> These all look somewhat promising as they kind of reveal the location
> information of the files.
>
> I'm not exactly sure how I would use these to hint at the data locations.
>  Since these chunks of the file appear to be somewhat arbitrary in size and
> offset, I don't know how I could perform imagery operations on them.  For
> example, if I knew that bytes 0x100-0x400 lie on node X, then that makes it
> difficult for me to use that information to give to my image libraries -
> does 0x100-0x400 correspond to some region/MBR within the image?  I'm not
> sure how to make use of this information.
>
> The responses I've gotten so far indicate to me that HDFS kind of does the
> computation migration for me but that I have to give it enough information
> to work with.  If someone could point to some detailed reading about this
> subject that would be pretty helpful, as I just can't find the
> documentation for it.
>
> Thanks again,
> -Julian
>
> On Tue, Mar 5, 2013 at 5:39 PM, Harsh J <ha...@cloudera.com> wrote:
>
>> Your concern is correct: If your input is a list of files, rather than
>> the files themselves, then the tasks would not be data-local - since
>> the task input would just be the list of files, and the files' data
>> may reside on any node/rack of the cluster.
>>
>> However, your job will still run as the HDFS reads do remote reads
>> transparently without developer intervention and all will still work
>> as you've written it to. If a block is found local to the DN, it is
>> read locally as well - all of this is automatic.
>>
>> Are your input lists big (for each compressed output)? And is the list
>> arbitrary or a defined list per goal?
>>
>> On Tue, Mar 5, 2013 at 5:19 PM, Julian Bui <ju...@gmail.com> wrote:
>> > Hi hadoop users,
>> >
>> > I'm trying to find out if computation migration is something the
>> developer
>> > needs to worry about or if it's supposed to be hidden.
>> >
>> > I would like to use hadoop to take in a list of image paths in the hdfs
>> and
>> > then have each task compress these large, raw images into something much
>> > smaller - say jpeg  files.
>> >
>> > Input: list of paths
>> > Output: compressed jpeg
>> >
>> > Since I don't really need a reduce task (I'm more using hadoop for its
>> > reliability and orchestration aspects), my mapper ought to just take the
>> > list of image paths and then work on them.  As I understand it, each
>> image
>> > will likely be on multiple data nodes.
>> >
>> > My question is how will each mapper task "migrate the computation" to
>> the
>> > data nodes?  I recall reading that the namenode is supposed to deal with
>> > this.  Is it hidden from the developer?  Or as the developer, do I need
>> to
>> > discover where the data lies and then migrate the task to that node?
>>  Since
>> > my input is just a list of paths, it seems like the namenode couldn't
>> really
>> > do this for me.
>> >
>> > Another question: Where can I find out more about this?  I've looked up
>> > "rack awareness" and "computation migration" but haven't really found
>> much
>> > code relating to either one - leading me to believe I'm not supposed to
>> have
>> > to write code to deal with this.
>> >
>> > Anyway, could someone please help me out or set me straight on this?
>> >
>> > Thanks,
>> > -Julian
>>
>>
>>
>> --
>> Harsh J
>>
>
>

Re: basic question about rack awareness and computation migration

Posted by Julian Bui <ju...@gmail.com>.
Thanks Harsh,

> Are your input lists big (for each compressed output)? And is the list
arbitrary or a defined list per goal?

I dictate what my inputs will look like.  If they need to be list of image
files, then I can do that.  If they need to be the images themselves as you
suggest, then I can do that too but I'm not exactly sure what that would
look like.  Basically, I will try to format my inputs in the way that makes
the most sense from a locality point of view.

Since all the keys must be writable, I explored the Writable interface and
found the interesting sub-classes:

   - FileSplit
   - BlockLocation
   - BytesWritable

These all look somewhat promising as they kind of reveal the location
information of the files.

I'm not exactly sure how I would use these to hint at the data locations.
 Since these chunks of the file appear to be somewhat arbitrary in size and
offset, I don't know how I could perform imagery operations on them.  For
example, if I knew that bytes 0x100-0x400 lie on node X, then that makes it
difficult for me to use that information to give to my image libraries -
does 0x100-0x400 correspond to some region/MBR within the image?  I'm not
sure how to make use of this information.

The responses I've gotten so far indicate to me that HDFS kind of does the
computation migration for me but that I have to give it enough information
to work with.  If someone could point to some detailed reading about this
subject that would be pretty helpful, as I just can't find the
documentation for it.

Thanks again,
-Julian

On Tue, Mar 5, 2013 at 5:39 PM, Harsh J <ha...@cloudera.com> wrote:

> Your concern is correct: If your input is a list of files, rather than
> the files themselves, then the tasks would not be data-local - since
> the task input would just be the list of files, and the files' data
> may reside on any node/rack of the cluster.
>
> However, your job will still run as the HDFS reads do remote reads
> transparently without developer intervention and all will still work
> as you've written it to. If a block is found local to the DN, it is
> read locally as well - all of this is automatic.
>
> Are your input lists big (for each compressed output)? And is the list
> arbitrary or a defined list per goal?
>
> On Tue, Mar 5, 2013 at 5:19 PM, Julian Bui <ju...@gmail.com> wrote:
> > Hi hadoop users,
> >
> > I'm trying to find out if computation migration is something the
> developer
> > needs to worry about or if it's supposed to be hidden.
> >
> > I would like to use hadoop to take in a list of image paths in the hdfs
> and
> > then have each task compress these large, raw images into something much
> > smaller - say jpeg  files.
> >
> > Input: list of paths
> > Output: compressed jpeg
> >
> > Since I don't really need a reduce task (I'm more using hadoop for its
> > reliability and orchestration aspects), my mapper ought to just take the
> > list of image paths and then work on them.  As I understand it, each
> image
> > will likely be on multiple data nodes.
> >
> > My question is how will each mapper task "migrate the computation" to the
> > data nodes?  I recall reading that the namenode is supposed to deal with
> > this.  Is it hidden from the developer?  Or as the developer, do I need
> to
> > discover where the data lies and then migrate the task to that node?
>  Since
> > my input is just a list of paths, it seems like the namenode couldn't
> really
> > do this for me.
> >
> > Another question: Where can I find out more about this?  I've looked up
> > "rack awareness" and "computation migration" but haven't really found
> much
> > code relating to either one - leading me to believe I'm not supposed to
> have
> > to write code to deal with this.
> >
> > Anyway, could someone please help me out or set me straight on this?
> >
> > Thanks,
> > -Julian
>
>
>
> --
> Harsh J
>

Re: basic question about rack awareness and computation migration

Posted by Julian Bui <ju...@gmail.com>.
Thanks Harsh,

> Are your input lists big (for each compressed output)? And is the list
arbitrary or a defined list per goal?

I dictate what my inputs will look like.  If they need to be list of image
files, then I can do that.  If they need to be the images themselves as you
suggest, then I can do that too but I'm not exactly sure what that would
look like.  Basically, I will try to format my inputs in the way that makes
the most sense from a locality point of view.

Since all the keys must be writable, I explored the Writable interface and
found the interesting sub-classes:

   - FileSplit
   - BlockLocation
   - BytesWritable

These all look somewhat promising as they kind of reveal the location
information of the files.

I'm not exactly sure how I would use these to hint at the data locations.
 Since these chunks of the file appear to be somewhat arbitrary in size and
offset, I don't know how I could perform imagery operations on them.  For
example, if I knew that bytes 0x100-0x400 lie on node X, then that makes it
difficult for me to use that information to give to my image libraries -
does 0x100-0x400 correspond to some region/MBR within the image?  I'm not
sure how to make use of this information.

The responses I've gotten so far indicate to me that HDFS kind of does the
computation migration for me but that I have to give it enough information
to work with.  If someone could point to some detailed reading about this
subject that would be pretty helpful, as I just can't find the
documentation for it.

Thanks again,
-Julian

On Tue, Mar 5, 2013 at 5:39 PM, Harsh J <ha...@cloudera.com> wrote:

> Your concern is correct: If your input is a list of files, rather than
> the files themselves, then the tasks would not be data-local - since
> the task input would just be the list of files, and the files' data
> may reside on any node/rack of the cluster.
>
> However, your job will still run as the HDFS reads do remote reads
> transparently without developer intervention and all will still work
> as you've written it to. If a block is found local to the DN, it is
> read locally as well - all of this is automatic.
>
> Are your input lists big (for each compressed output)? And is the list
> arbitrary or a defined list per goal?
>
> On Tue, Mar 5, 2013 at 5:19 PM, Julian Bui <ju...@gmail.com> wrote:
> > Hi hadoop users,
> >
> > I'm trying to find out if computation migration is something the
> developer
> > needs to worry about or if it's supposed to be hidden.
> >
> > I would like to use hadoop to take in a list of image paths in the hdfs
> and
> > then have each task compress these large, raw images into something much
> > smaller - say jpeg  files.
> >
> > Input: list of paths
> > Output: compressed jpeg
> >
> > Since I don't really need a reduce task (I'm more using hadoop for its
> > reliability and orchestration aspects), my mapper ought to just take the
> > list of image paths and then work on them.  As I understand it, each
> image
> > will likely be on multiple data nodes.
> >
> > My question is how will each mapper task "migrate the computation" to the
> > data nodes?  I recall reading that the namenode is supposed to deal with
> > this.  Is it hidden from the developer?  Or as the developer, do I need
> to
> > discover where the data lies and then migrate the task to that node?
>  Since
> > my input is just a list of paths, it seems like the namenode couldn't
> really
> > do this for me.
> >
> > Another question: Where can I find out more about this?  I've looked up
> > "rack awareness" and "computation migration" but haven't really found
> much
> > code relating to either one - leading me to believe I'm not supposed to
> have
> > to write code to deal with this.
> >
> > Anyway, could someone please help me out or set me straight on this?
> >
> > Thanks,
> > -Julian
>
>
>
> --
> Harsh J
>

Re: basic question about rack awareness and computation migration

Posted by Julian Bui <ju...@gmail.com>.
Thanks Harsh,

> Are your input lists big (for each compressed output)? And is the list
arbitrary or a defined list per goal?

I dictate what my inputs will look like.  If they need to be list of image
files, then I can do that.  If they need to be the images themselves as you
suggest, then I can do that too but I'm not exactly sure what that would
look like.  Basically, I will try to format my inputs in the way that makes
the most sense from a locality point of view.

Since all the keys must be writable, I explored the Writable interface and
found the interesting sub-classes:

   - FileSplit
   - BlockLocation
   - BytesWritable

These all look somewhat promising as they kind of reveal the location
information of the files.

I'm not exactly sure how I would use these to hint at the data locations.
 Since these chunks of the file appear to be somewhat arbitrary in size and
offset, I don't know how I could perform imagery operations on them.  For
example, if I knew that bytes 0x100-0x400 lie on node X, then that makes it
difficult for me to use that information to give to my image libraries -
does 0x100-0x400 correspond to some region/MBR within the image?  I'm not
sure how to make use of this information.

The responses I've gotten so far indicate to me that HDFS kind of does the
computation migration for me but that I have to give it enough information
to work with.  If someone could point to some detailed reading about this
subject that would be pretty helpful, as I just can't find the
documentation for it.

Thanks again,
-Julian

On Tue, Mar 5, 2013 at 5:39 PM, Harsh J <ha...@cloudera.com> wrote:

> Your concern is correct: If your input is a list of files, rather than
> the files themselves, then the tasks would not be data-local - since
> the task input would just be the list of files, and the files' data
> may reside on any node/rack of the cluster.
>
> However, your job will still run as the HDFS reads do remote reads
> transparently without developer intervention and all will still work
> as you've written it to. If a block is found local to the DN, it is
> read locally as well - all of this is automatic.
>
> Are your input lists big (for each compressed output)? And is the list
> arbitrary or a defined list per goal?
>
> On Tue, Mar 5, 2013 at 5:19 PM, Julian Bui <ju...@gmail.com> wrote:
> > Hi hadoop users,
> >
> > I'm trying to find out if computation migration is something the
> developer
> > needs to worry about or if it's supposed to be hidden.
> >
> > I would like to use hadoop to take in a list of image paths in the hdfs
> and
> > then have each task compress these large, raw images into something much
> > smaller - say jpeg  files.
> >
> > Input: list of paths
> > Output: compressed jpeg
> >
> > Since I don't really need a reduce task (I'm more using hadoop for its
> > reliability and orchestration aspects), my mapper ought to just take the
> > list of image paths and then work on them.  As I understand it, each
> image
> > will likely be on multiple data nodes.
> >
> > My question is how will each mapper task "migrate the computation" to the
> > data nodes?  I recall reading that the namenode is supposed to deal with
> > this.  Is it hidden from the developer?  Or as the developer, do I need
> to
> > discover where the data lies and then migrate the task to that node?
>  Since
> > my input is just a list of paths, it seems like the namenode couldn't
> really
> > do this for me.
> >
> > Another question: Where can I find out more about this?  I've looked up
> > "rack awareness" and "computation migration" but haven't really found
> much
> > code relating to either one - leading me to believe I'm not supposed to
> have
> > to write code to deal with this.
> >
> > Anyway, could someone please help me out or set me straight on this?
> >
> > Thanks,
> > -Julian
>
>
>
> --
> Harsh J
>

Re: basic question about rack awareness and computation migration

Posted by Julian Bui <ju...@gmail.com>.
Thanks Harsh,

> Are your input lists big (for each compressed output)? And is the list
arbitrary or a defined list per goal?

I dictate what my inputs will look like.  If they need to be list of image
files, then I can do that.  If they need to be the images themselves as you
suggest, then I can do that too but I'm not exactly sure what that would
look like.  Basically, I will try to format my inputs in the way that makes
the most sense from a locality point of view.

Since all the keys must be writable, I explored the Writable interface and
found the interesting sub-classes:

   - FileSplit
   - BlockLocation
   - BytesWritable

These all look somewhat promising as they kind of reveal the location
information of the files.

I'm not exactly sure how I would use these to hint at the data locations.
 Since these chunks of the file appear to be somewhat arbitrary in size and
offset, I don't know how I could perform imagery operations on them.  For
example, if I knew that bytes 0x100-0x400 lie on node X, then that makes it
difficult for me to use that information to give to my image libraries -
does 0x100-0x400 correspond to some region/MBR within the image?  I'm not
sure how to make use of this information.

The responses I've gotten so far indicate to me that HDFS kind of does the
computation migration for me but that I have to give it enough information
to work with.  If someone could point to some detailed reading about this
subject that would be pretty helpful, as I just can't find the
documentation for it.

Thanks again,
-Julian

On Tue, Mar 5, 2013 at 5:39 PM, Harsh J <ha...@cloudera.com> wrote:

> Your concern is correct: If your input is a list of files, rather than
> the files themselves, then the tasks would not be data-local - since
> the task input would just be the list of files, and the files' data
> may reside on any node/rack of the cluster.
>
> However, your job will still run as the HDFS reads do remote reads
> transparently without developer intervention and all will still work
> as you've written it to. If a block is found local to the DN, it is
> read locally as well - all of this is automatic.
>
> Are your input lists big (for each compressed output)? And is the list
> arbitrary or a defined list per goal?
>
> On Tue, Mar 5, 2013 at 5:19 PM, Julian Bui <ju...@gmail.com> wrote:
> > Hi hadoop users,
> >
> > I'm trying to find out if computation migration is something the
> developer
> > needs to worry about or if it's supposed to be hidden.
> >
> > I would like to use hadoop to take in a list of image paths in the hdfs
> and
> > then have each task compress these large, raw images into something much
> > smaller - say jpeg  files.
> >
> > Input: list of paths
> > Output: compressed jpeg
> >
> > Since I don't really need a reduce task (I'm more using hadoop for its
> > reliability and orchestration aspects), my mapper ought to just take the
> > list of image paths and then work on them.  As I understand it, each
> image
> > will likely be on multiple data nodes.
> >
> > My question is how will each mapper task "migrate the computation" to the
> > data nodes?  I recall reading that the namenode is supposed to deal with
> > this.  Is it hidden from the developer?  Or as the developer, do I need
> to
> > discover where the data lies and then migrate the task to that node?
>  Since
> > my input is just a list of paths, it seems like the namenode couldn't
> really
> > do this for me.
> >
> > Another question: Where can I find out more about this?  I've looked up
> > "rack awareness" and "computation migration" but haven't really found
> much
> > code relating to either one - leading me to believe I'm not supposed to
> have
> > to write code to deal with this.
> >
> > Anyway, could someone please help me out or set me straight on this?
> >
> > Thanks,
> > -Julian
>
>
>
> --
> Harsh J
>

Re: basic question about rack awareness and computation migration

Posted by Harsh J <ha...@cloudera.com>.
Your concern is correct: If your input is a list of files, rather than
the files themselves, then the tasks would not be data-local - since
the task input would just be the list of files, and the files' data
may reside on any node/rack of the cluster.

However, your job will still run as the HDFS reads do remote reads
transparently without developer intervention and all will still work
as you've written it to. If a block is found local to the DN, it is
read locally as well - all of this is automatic.

Are your input lists big (for each compressed output)? And is the list
arbitrary or a defined list per goal?

On Tue, Mar 5, 2013 at 5:19 PM, Julian Bui <ju...@gmail.com> wrote:
> Hi hadoop users,
>
> I'm trying to find out if computation migration is something the developer
> needs to worry about or if it's supposed to be hidden.
>
> I would like to use hadoop to take in a list of image paths in the hdfs and
> then have each task compress these large, raw images into something much
> smaller - say jpeg  files.
>
> Input: list of paths
> Output: compressed jpeg
>
> Since I don't really need a reduce task (I'm more using hadoop for its
> reliability and orchestration aspects), my mapper ought to just take the
> list of image paths and then work on them.  As I understand it, each image
> will likely be on multiple data nodes.
>
> My question is how will each mapper task "migrate the computation" to the
> data nodes?  I recall reading that the namenode is supposed to deal with
> this.  Is it hidden from the developer?  Or as the developer, do I need to
> discover where the data lies and then migrate the task to that node?  Since
> my input is just a list of paths, it seems like the namenode couldn't really
> do this for me.
>
> Another question: Where can I find out more about this?  I've looked up
> "rack awareness" and "computation migration" but haven't really found much
> code relating to either one - leading me to believe I'm not supposed to have
> to write code to deal with this.
>
> Anyway, could someone please help me out or set me straight on this?
>
> Thanks,
> -Julian



--
Harsh J

Re: basic question about rack awareness and computation migration

Posted by Rohit Kochar <mn...@gmail.com>.
Hello ,
To be precise this is hidden from the developer and you need not write any code for this.
Whenever any file is stored in HDFS than it is splitted into block size of configured size and each block could potentially be stored on different datanode.All this information of which file contains which blocks resides with the namenode.

So essentially whenever a file is accessed via DFS Client it requests the  NameNode for metadata,
which DFS client uses to provide the file in streaming fashion to enduser.

Since namenode knows the location of all the blocks/files ,a task can be scheduled by hadoop to be executed on the same node which is having data.

Thanks
Rohit Kochar

On 05-Mar-2013, at 5:19 PM, Julian Bui wrote:

> Hi hadoop users,
> 
> I'm trying to find out if computation migration is something the developer needs to worry about or if it's supposed to be hidden.
> 
> I would like to use hadoop to take in a list of image paths in the hdfs and then have each task compress these large, raw images into something much smaller - say jpeg  files.  
> 
> Input: list of paths
> Output: compressed jpeg
> 
> Since I don't really need a reduce task (I'm more using hadoop for its reliability and orchestration aspects), my mapper ought to just take the list of image paths and then work on them.  As I understand it, each image will likely be on multiple data nodes.  
> 
> My question is how will each mapper task "migrate the computation" to the data nodes?  I recall reading that the namenode is supposed to deal with this.  Is it hidden from the developer?  Or as the developer, do I need to discover where the data lies and then migrate the task to that node?  Since my input is just a list of paths, it seems like the namenode couldn't really do this for me.
> 
> Another question: Where can I find out more about this?  I've looked up "rack awareness" and "computation migration" but haven't really found much code relating to either one - leading me to believe I'm not supposed to have to write code to deal with this.
> 
> Anyway, could someone please help me out or set me straight on this?
> 
> Thanks,
> -Julian


Re: basic question about rack awareness and computation migration

Posted by Harsh J <ha...@cloudera.com>.
Your concern is correct: If your input is a list of files, rather than
the files themselves, then the tasks would not be data-local - since
the task input would just be the list of files, and the files' data
may reside on any node/rack of the cluster.

However, your job will still run as the HDFS reads do remote reads
transparently without developer intervention and all will still work
as you've written it to. If a block is found local to the DN, it is
read locally as well - all of this is automatic.

Are your input lists big (for each compressed output)? And is the list
arbitrary or a defined list per goal?

On Tue, Mar 5, 2013 at 5:19 PM, Julian Bui <ju...@gmail.com> wrote:
> Hi hadoop users,
>
> I'm trying to find out if computation migration is something the developer
> needs to worry about or if it's supposed to be hidden.
>
> I would like to use hadoop to take in a list of image paths in the hdfs and
> then have each task compress these large, raw images into something much
> smaller - say jpeg  files.
>
> Input: list of paths
> Output: compressed jpeg
>
> Since I don't really need a reduce task (I'm more using hadoop for its
> reliability and orchestration aspects), my mapper ought to just take the
> list of image paths and then work on them.  As I understand it, each image
> will likely be on multiple data nodes.
>
> My question is how will each mapper task "migrate the computation" to the
> data nodes?  I recall reading that the namenode is supposed to deal with
> this.  Is it hidden from the developer?  Or as the developer, do I need to
> discover where the data lies and then migrate the task to that node?  Since
> my input is just a list of paths, it seems like the namenode couldn't really
> do this for me.
>
> Another question: Where can I find out more about this?  I've looked up
> "rack awareness" and "computation migration" but haven't really found much
> code relating to either one - leading me to believe I'm not supposed to have
> to write code to deal with this.
>
> Anyway, could someone please help me out or set me straight on this?
>
> Thanks,
> -Julian



--
Harsh J

Re: basic question about rack awareness and computation migration

Posted by Harsh J <ha...@cloudera.com>.
Your concern is correct: If your input is a list of files, rather than
the files themselves, then the tasks would not be data-local - since
the task input would just be the list of files, and the files' data
may reside on any node/rack of the cluster.

However, your job will still run as the HDFS reads do remote reads
transparently without developer intervention and all will still work
as you've written it to. If a block is found local to the DN, it is
read locally as well - all of this is automatic.

Are your input lists big (for each compressed output)? And is the list
arbitrary or a defined list per goal?

On Tue, Mar 5, 2013 at 5:19 PM, Julian Bui <ju...@gmail.com> wrote:
> Hi hadoop users,
>
> I'm trying to find out if computation migration is something the developer
> needs to worry about or if it's supposed to be hidden.
>
> I would like to use hadoop to take in a list of image paths in the hdfs and
> then have each task compress these large, raw images into something much
> smaller - say jpeg  files.
>
> Input: list of paths
> Output: compressed jpeg
>
> Since I don't really need a reduce task (I'm more using hadoop for its
> reliability and orchestration aspects), my mapper ought to just take the
> list of image paths and then work on them.  As I understand it, each image
> will likely be on multiple data nodes.
>
> My question is how will each mapper task "migrate the computation" to the
> data nodes?  I recall reading that the namenode is supposed to deal with
> this.  Is it hidden from the developer?  Or as the developer, do I need to
> discover where the data lies and then migrate the task to that node?  Since
> my input is just a list of paths, it seems like the namenode couldn't really
> do this for me.
>
> Another question: Where can I find out more about this?  I've looked up
> "rack awareness" and "computation migration" but haven't really found much
> code relating to either one - leading me to believe I'm not supposed to have
> to write code to deal with this.
>
> Anyway, could someone please help me out or set me straight on this?
>
> Thanks,
> -Julian



--
Harsh J

Re: basic question about rack awareness and computation migration

Posted by Harsh J <ha...@cloudera.com>.
Your concern is correct: If your input is a list of files, rather than
the files themselves, then the tasks would not be data-local - since
the task input would just be the list of files, and the files' data
may reside on any node/rack of the cluster.

However, your job will still run as the HDFS reads do remote reads
transparently without developer intervention and all will still work
as you've written it to. If a block is found local to the DN, it is
read locally as well - all of this is automatic.

Are your input lists big (for each compressed output)? And is the list
arbitrary or a defined list per goal?

On Tue, Mar 5, 2013 at 5:19 PM, Julian Bui <ju...@gmail.com> wrote:
> Hi hadoop users,
>
> I'm trying to find out if computation migration is something the developer
> needs to worry about or if it's supposed to be hidden.
>
> I would like to use hadoop to take in a list of image paths in the hdfs and
> then have each task compress these large, raw images into something much
> smaller - say jpeg  files.
>
> Input: list of paths
> Output: compressed jpeg
>
> Since I don't really need a reduce task (I'm more using hadoop for its
> reliability and orchestration aspects), my mapper ought to just take the
> list of image paths and then work on them.  As I understand it, each image
> will likely be on multiple data nodes.
>
> My question is how will each mapper task "migrate the computation" to the
> data nodes?  I recall reading that the namenode is supposed to deal with
> this.  Is it hidden from the developer?  Or as the developer, do I need to
> discover where the data lies and then migrate the task to that node?  Since
> my input is just a list of paths, it seems like the namenode couldn't really
> do this for me.
>
> Another question: Where can I find out more about this?  I've looked up
> "rack awareness" and "computation migration" but haven't really found much
> code relating to either one - leading me to believe I'm not supposed to have
> to write code to deal with this.
>
> Anyway, could someone please help me out or set me straight on this?
>
> Thanks,
> -Julian



--
Harsh J

Re: basic question about rack awareness and computation migration

Posted by Rohit Kochar <mn...@gmail.com>.
Hello ,
To be precise this is hidden from the developer and you need not write any code for this.
Whenever any file is stored in HDFS than it is splitted into block size of configured size and each block could potentially be stored on different datanode.All this information of which file contains which blocks resides with the namenode.

So essentially whenever a file is accessed via DFS Client it requests the  NameNode for metadata,
which DFS client uses to provide the file in streaming fashion to enduser.

Since namenode knows the location of all the blocks/files ,a task can be scheduled by hadoop to be executed on the same node which is having data.

Thanks
Rohit Kochar

On 05-Mar-2013, at 5:19 PM, Julian Bui wrote:

> Hi hadoop users,
> 
> I'm trying to find out if computation migration is something the developer needs to worry about or if it's supposed to be hidden.
> 
> I would like to use hadoop to take in a list of image paths in the hdfs and then have each task compress these large, raw images into something much smaller - say jpeg  files.  
> 
> Input: list of paths
> Output: compressed jpeg
> 
> Since I don't really need a reduce task (I'm more using hadoop for its reliability and orchestration aspects), my mapper ought to just take the list of image paths and then work on them.  As I understand it, each image will likely be on multiple data nodes.  
> 
> My question is how will each mapper task "migrate the computation" to the data nodes?  I recall reading that the namenode is supposed to deal with this.  Is it hidden from the developer?  Or as the developer, do I need to discover where the data lies and then migrate the task to that node?  Since my input is just a list of paths, it seems like the namenode couldn't really do this for me.
> 
> Another question: Where can I find out more about this?  I've looked up "rack awareness" and "computation migration" but haven't really found much code relating to either one - leading me to believe I'm not supposed to have to write code to deal with this.
> 
> Anyway, could someone please help me out or set me straight on this?
> 
> Thanks,
> -Julian