You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Enis Soztutar (JIRA)" <ji...@apache.org> on 2007/06/21 15:25:49 UTC

[jira] Updated: (HADOOP-1515) MultiFileSplit, MultiFileInputFormat

     [ https://issues.apache.org/jira/browse/HADOOP-1515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Enis Soztutar updated HADOOP-1515:
----------------------------------

    Attachment: multiFile_v1.0.patch

{{multiFile_v1.0.patch}}

This patch implements two classes : MultiFileSplit and MultiFileInputFormat. Below are the javadocs : 

{code}
/**
 * A sub-collection of input files. Unlike FileSplit, MultiFileSplit 
 * class does not represent a split of a file, but a split of input files 
 * into smaller sets. The atomic unit of split is a file.  
 * MultiFileSplit can be used to implement RecordReader's, with 
 * reading one record per file.
 */
public class MultiFileSplit implements InputSplit
{code}

and 

{code}
/**
 * An abstract  InputFormat that returns MultiFileSplit's
 * in  #getSplits(JobConf, int) method. Splits are constructed from 
 * the files under the input paths. Each split returned contains nearly
 * equal content length. 
 * Subclasses implement #getRecordReader(InputSplit, JobConf, Reporter)
 * to construct RecordReader's for MultiFileSplit's.
 */
public abstract class MultiFileInputFormat extends FileInputFormat
{code}

I have successfully tested this implementations as a part of a job, containing more than 15k input files, one record per file and 2GB of data. 


> MultiFileSplit, MultiFileInputFormat
> ------------------------------------
>
>                 Key: HADOOP-1515
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1515
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.14.0
>            Reporter: Enis Soztutar
>            Assignee: Enis Soztutar
>             Fix For: 0.14.0
>
>         Attachments: multiFile_v1.0.patch
>
>
> An {{InputSplit}} and {{InputFormat}} implementation for jobs that require to read records from many files. The input is partitioned by files. This can be used for example to implement {{RecordReader}}s which read one record from a file. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.