You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Jimmy Wan <ji...@indeed.com> on 2008/03/11 17:38:01 UTC
Splitting compressed input from a single job to multiple map tasks
Is it possible to split compressed input from a single job to multiple map
tasks? My current configuration has several task trackers but the job I
kick off results in a single map task. I'm launching these jobs in
sequence via a shell script, so they end up going through a pipeline of 1
concurrent map which is kinda suboptimal.
When I run this task via a full local hadoop stack it does seem to split
the file into multiple small task chunks.
--
Jimmy
Re: Splitting compressed input from a single job to multiple map tasks
Posted by Owen O'Malley <oo...@yahoo-inc.com>.
On Mar 11, 2008, at 9:38 AM, Jimmy Wan wrote:
> Is it possible to split compressed input from a single job to
> multiple map tasks?
It depends on the form of the compression. If you are using the zlib
(gzip) text file compression, then no. The problem is that there is
no way to start in the middle of the stream. If you use block
compressed sequence files, then it will work fine. There are rumors
of a bzip input format that supports input splitting, but bzip is
very slow for many applications. (Although the compression is good)
-- Owen