You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Enis Soztutar (JIRA)" <ji...@apache.org> on 2007/06/21 15:25:49 UTC
[jira] Updated: (HADOOP-1515) MultiFileSplit, MultiFileInputFormat
[ https://issues.apache.org/jira/browse/HADOOP-1515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Enis Soztutar updated HADOOP-1515:
----------------------------------
Attachment: multiFile_v1.0.patch
{{multiFile_v1.0.patch}}
This patch implements two classes : MultiFileSplit and MultiFileInputFormat. Below are the javadocs :
{code}
/**
* A sub-collection of input files. Unlike FileSplit, MultiFileSplit
* class does not represent a split of a file, but a split of input files
* into smaller sets. The atomic unit of split is a file.
* MultiFileSplit can be used to implement RecordReader's, with
* reading one record per file.
*/
public class MultiFileSplit implements InputSplit
{code}
and
{code}
/**
* An abstract InputFormat that returns MultiFileSplit's
* in #getSplits(JobConf, int) method. Splits are constructed from
* the files under the input paths. Each split returned contains nearly
* equal content length.
* Subclasses implement #getRecordReader(InputSplit, JobConf, Reporter)
* to construct RecordReader's for MultiFileSplit's.
*/
public abstract class MultiFileInputFormat extends FileInputFormat
{code}
I have successfully tested this implementations as a part of a job, containing more than 15k input files, one record per file and 2GB of data.
> MultiFileSplit, MultiFileInputFormat
> ------------------------------------
>
> Key: HADOOP-1515
> URL: https://issues.apache.org/jira/browse/HADOOP-1515
> Project: Hadoop
> Issue Type: New Feature
> Components: mapred
> Affects Versions: 0.14.0
> Reporter: Enis Soztutar
> Assignee: Enis Soztutar
> Fix For: 0.14.0
>
> Attachments: multiFile_v1.0.patch
>
>
> An {{InputSplit}} and {{InputFormat}} implementation for jobs that require to read records from many files. The input is partitioned by files. This can be used for example to implement {{RecordReader}}s which read one record from a file.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.