You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Jimmy Wan <ji...@indeed.com> on 2008/03/11 17:38:01 UTC

Splitting compressed input from a single job to multiple map tasks

Is it possible to split compressed input from a single job to multiple map  
tasks? My current configuration has several task trackers but the job I  
kick off results in a single map task. I'm launching these jobs in  
sequence via a shell script, so they end up going through a pipeline of 1  
concurrent map which is kinda suboptimal.

When I run this task via a full local hadoop stack it does seem to split  
the file into multiple small task chunks.

-- 
Jimmy

Re: Splitting compressed input from a single job to multiple map tasks

Posted by Owen O'Malley <oo...@yahoo-inc.com>.
On Mar 11, 2008, at 9:38 AM, Jimmy Wan wrote:

> Is it possible to split compressed input from a single job to  
> multiple map tasks?

It depends on the form of the compression. If you are using the zlib  
(gzip) text file compression, then no. The problem is that there is  
no way to start in the middle of the stream. If you use block  
compressed sequence files, then it will work fine. There are rumors  
of a bzip input format that supports input splitting, but bzip is  
very slow for many applications. (Although the compression is good)

-- Owen