You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Owen O'Malley (JIRA)" <ji...@apache.org> on 2006/09/26 07:42:51 UTC

[jira] Resolved: (HADOOP-374) native support for gzipped text files

     [ http://issues.apache.org/jira/browse/HADOOP-374?page=all ]

Owen O'Malley resolved HADOOP-374.
----------------------------------

    Fix Version/s: 0.6.2
       Resolution: Fixed

> native support for gzipped text files
> -------------------------------------
>
>                 Key: HADOOP-374
>                 URL: http://issues.apache.org/jira/browse/HADOOP-374
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: mapred
>            Reporter: Yoram Arnon
>             Fix For: 0.6.2
>
>
> in many cases it is convenient to store text files in dfs as gzip compressed files.
> It would be good to have built in support for processing these files in a mapreduce job.
> The getSplits implementation should return a single split per input file, ignoring the numSplits parameter.
> One can probably subclass InputFormatBase, and the getSplits method can simply call listPaths() 
> and then construct and return a single split per path returned.
> The code for reading would look something like (courtesy of Vijay Murthy):
>    public RecordReader getRecordReader(FileSystem fs, FileSplit split,
>                                        JobConf job, Reporter reporter)
>      throws IOException {
>      final BufferedReader in =
>        new BufferedReader(new InputStreamReader
>          (new GZIPInputStream(fs.open(split.getPath()))));
>      return new RecordReader() {
>          long position;
>          public synchronized boolean next(Writable key, Writable value)
>            throws IOException {
>            String line = in.readLine();
>            if (line != null) {
>              position += line.length();
>              ((UTF8)value).set(line);
>              return true;
>            }
>            return false;
>          }
>          public synchronized long getPos() throws IOException {
>            return position;
>          }
>         public synchronized void close() throws IOException {
>            in.close();
>          }
>        };
>    }

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira