You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-dev@hadoop.apache.org by shruti jain <sh...@gmail.com> on 2009/07/13 12:18:28 UTC

Speculative Processing in Hadoop

Hello Everyone,

I am a newbie and need some help. I saw on Hadoop wiki that there can
be projects to improve Hadoop and map-reduce performance on available
benchmarks(sort etc)..

In a distributed file system environment, caching can be followed. In
such systems, whenever a file access is required, the client has to
check the content in the local cache with reference to the server file
system. By the time server responds to this query of the client, the
client can execute the requested operations on the data available in
the cache. If the server responds that the client has the most
recently modified file then the client can proceed with the processing
otherwise it can rollback to a previous state and start with newer
version of the file. This will save processing power, CPU cycles time.

This can be applied to Hadoop as well. Say we are sorting a file. With
map-reduce sorting can be done this way. A client requests the server
about the modification time of the file and starts execution on the
file it has in the cache. When server responds it can check the cached
copy and proceed accordingly.

Could any one please discuss whether this can be done in Hadoop or
not. Is it already implemented or is anyone else working on the same.
If this is not the right place to discuss then can you direct me to
some other source of information.

Thank You.

Shruti

Re: Speculative Processing in Hadoop

Posted by Dhruba Borthakur <dh...@gmail.com>.

For most hadoop use-cases, the size of the working set of data for a job far
exceeds the disk/memory capacity of a single machine. Because of this
reason, caching data does not help most Hadoop workloads. Hadoop clients
also have built-in read-aheads for sequential data access.

If you have a workload that can leverage the benefits of caching data, then
you can always implement it as a layer on top of Hadoop. You can write
something like a CacheFileSystem (similar in lines to ChecksumFileSystem)
that can be layered above a FileSystem client.

thanks,
dhruba


On Mon, Jul 13, 2009 at 3:18 AM, shruti jain <sh...@gmail.com>wrote:

> Hello Everyone,
>
> I am a newbie and need some help. I saw on Hadoop wiki that there can
> be projects to improve Hadoop and map-reduce performance on available
> benchmarks(sort etc)..
>
> In a distributed file system environment, caching can be followed. In
> such systems, whenever a file access is required, the client has to
> check the content in the local cache with reference to the server file
> system. By the time server responds to this query of the client, the
> client can execute the requested operations on the data available in
> the cache. If the server responds that the client has the most
> recently modified file then the client can proceed with the processing
> otherwise it can rollback to a previous state and start with newer
> version of the file. This will save processing power, CPU cycles time.
>
> This can be applied to Hadoop as well. Say we are sorting a file. With
> map-reduce sorting can be done this way. A client requests the server
> about the modification time of the file and starts execution on the
> file it has in the cache. When server responds it can check the cached
> copy and proceed accordingly.
>
> Could any one please discuss whether this can be done in Hadoop or
> not. Is it already implemented or is anyone else working on the same.
> If this is not the right place to discuss then can you direct me to
> some other source of information.
>
> Thank You.
>
> Shruti
>

Re: Speculative Processing in Hadoop

Posted by Andrey Kuzmin <an...@gmail.com>.

As far as I understand, Hadoop solves the problem you're trying to
tackle differently  - it moves computation close to the data (by
assigning task to a node that possesses local copy), not vice versa
where speculative execution might turn helpful as, e.g., in network
file-systems.

Regards,
Andrey



On Mon, Jul 13, 2009 at 7:35 PM, shruti jain<sh...@gmail.com> wrote:
> Hello Everyone,
>
> I am a newbie and need some help. I saw on Hadoop wiki that there can
> be projects to improve Hadoop and map-reduce performance on available
> benchmarks(sort etc)..
>
> In a distributed file system environment, caching can be followed. In
> such systems, whenever a file access is required, the client has to
> check the content in the local cache with reference to the server file
> system. By the time server responds to this query of the client, the
> client can execute the requested operations on the data available in
> the cache. If the server responds that the client has the most
> recently modified file then the client can proceed with the processing
> otherwise it can rollback to a previous state and start with newer
> version of the file. This will save processing power, CPU cycles time.
>
> This can be applied to Hadoop as well. Say we are sorting a file. With
> map-reduce sorting can be done this way. A client requests the server
> about the modification time of the file and starts execution on the
> file it has in the cache. When server responds it can check the cached
> copy and proceed accordingly.
>
> Could any one please discuss whether this can be done in Hadoop or
> not. Is it already implemented or is anyone else working on the same.
> If this is not the right place to discuss then can you direct me to
> some other source of information.
>
> Thank You.
>
> Shruti
>

Speculative Processing in Hadoop

Posted by shruti jain <sh...@gmail.com>.

Hello Everyone,

I am a newbie and need some help. I saw on Hadoop wiki that there can
be projects to improve Hadoop and map-reduce performance on available
benchmarks(sort etc)..

In a distributed file system environment, caching can be followed. In
such systems, whenever a file access is required, the client has to
check the content in the local cache with reference to the server file
system. By the time server responds to this query of the client, the
client can execute the requested operations on the data available in
the cache. If the server responds that the client has the most
recently modified file then the client can proceed with the processing
otherwise it can rollback to a previous state and start with newer
version of the file. This will save processing power, CPU cycles time.

This can be applied to Hadoop as well. Say we are sorting a file. With
map-reduce sorting can be done this way. A client requests the server
about the modification time of the file and starts execution on the
file it has in the cache. When server responds it can check the cached
copy and proceed accordingly.

Could any one please discuss whether this can be done in Hadoop or
not. Is it already implemented or is anyone else working on the same.
If this is not the right place to discuss then can you direct me to
some other source of information.

Thank You.

Shruti