You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Enis Soztutar (JIRA)" <ji...@apache.org> on 2007/06/21 15:19:25 UTC
[jira] Created: (HADOOP-1515) MultiFileSplit, MultiFileInputFormat
MultiFileSplit, MultiFileInputFormat
------------------------------------
Key: HADOOP-1515
URL: https://issues.apache.org/jira/browse/HADOOP-1515
Project: Hadoop
Issue Type: New Feature
Components: mapred
Affects Versions: 0.14.0
Reporter: Enis Soztutar
Assignee: Enis Soztutar
Fix For: 0.14.0
An {{InputSplit}} and {{InputFormat}} implementation for jobs that require to read records from many files. The input is partitioned by files. This can be used for example to implement {{RecordReader}}s which read one record from a file.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-1515) MultiFileSplit,
MultiFileInputFormat
Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-1515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12507947 ]
Hadoop QA commented on HADOOP-1515:
-----------------------------------
+1
http://issues.apache.org/jira/secure/attachment/12360492/multiFile_v1.1.patch applied and successfully tested against trunk revision r549977.
Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/326/testReport/
Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/326/console
> MultiFileSplit, MultiFileInputFormat
> ------------------------------------
>
> Key: HADOOP-1515
> URL: https://issues.apache.org/jira/browse/HADOOP-1515
> Project: Hadoop
> Issue Type: New Feature
> Components: mapred
> Affects Versions: 0.14.0
> Reporter: Enis Soztutar
> Assignee: Enis Soztutar
> Fix For: 0.14.0
>
> Attachments: multiFile_v1.0.patch, multiFile_v1.1.patch
>
>
> An {{InputSplit}} and {{InputFormat}} implementation for jobs that require to read records from many files. The input is partitioned by files. This can be used for example to implement {{RecordReader}}s which read one record from a file.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (HADOOP-1515) MultiFileSplit, MultiFileInputFormat
Posted by "Enis Soztutar (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-1515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Enis Soztutar updated HADOOP-1515:
----------------------------------
Status: Patch Available (was: Open)
> MultiFileSplit, MultiFileInputFormat
> ------------------------------------
>
> Key: HADOOP-1515
> URL: https://issues.apache.org/jira/browse/HADOOP-1515
> Project: Hadoop
> Issue Type: New Feature
> Components: mapred
> Affects Versions: 0.14.0
> Reporter: Enis Soztutar
> Assignee: Enis Soztutar
> Fix For: 0.14.0
>
> Attachments: multiFile_v1.0.patch
>
>
> An {{InputSplit}} and {{InputFormat}} implementation for jobs that require to read records from many files. The input is partitioned by files. This can be used for example to implement {{RecordReader}}s which read one record from a file.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-1515) MultiFileSplit,
MultiFileInputFormat
Posted by "Enis Soztutar (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-1515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12507032 ]
Enis Soztutar commented on HADOOP-1515:
---------------------------------------
> But can you please add a unit test?
I'll be looking into this ASAP.
> MultiFileSplit, MultiFileInputFormat
> ------------------------------------
>
> Key: HADOOP-1515
> URL: https://issues.apache.org/jira/browse/HADOOP-1515
> Project: Hadoop
> Issue Type: New Feature
> Components: mapred
> Affects Versions: 0.14.0
> Reporter: Enis Soztutar
> Assignee: Enis Soztutar
> Fix For: 0.14.0
>
> Attachments: multiFile_v1.0.patch
>
>
> An {{InputSplit}} and {{InputFormat}} implementation for jobs that require to read records from many files. The input is partitioned by files. This can be used for example to implement {{RecordReader}}s which read one record from a file.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (HADOOP-1515) MultiFileSplit, MultiFileInputFormat
Posted by "Enis Soztutar (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-1515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Enis Soztutar updated HADOOP-1515:
----------------------------------
Attachment: multiFile_v1.0.patch
{{multiFile_v1.0.patch}}
This patch implements two classes : MultiFileSplit and MultiFileInputFormat. Below are the javadocs :
{code}
/**
* A sub-collection of input files. Unlike FileSplit, MultiFileSplit
* class does not represent a split of a file, but a split of input files
* into smaller sets. The atomic unit of split is a file.
* MultiFileSplit can be used to implement RecordReader's, with
* reading one record per file.
*/
public class MultiFileSplit implements InputSplit
{code}
and
{code}
/**
* An abstract InputFormat that returns MultiFileSplit's
* in #getSplits(JobConf, int) method. Splits are constructed from
* the files under the input paths. Each split returned contains nearly
* equal content length.
* Subclasses implement #getRecordReader(InputSplit, JobConf, Reporter)
* to construct RecordReader's for MultiFileSplit's.
*/
public abstract class MultiFileInputFormat extends FileInputFormat
{code}
I have successfully tested this implementations as a part of a job, containing more than 15k input files, one record per file and 2GB of data.
> MultiFileSplit, MultiFileInputFormat
> ------------------------------------
>
> Key: HADOOP-1515
> URL: https://issues.apache.org/jira/browse/HADOOP-1515
> Project: Hadoop
> Issue Type: New Feature
> Components: mapred
> Affects Versions: 0.14.0
> Reporter: Enis Soztutar
> Assignee: Enis Soztutar
> Fix For: 0.14.0
>
> Attachments: multiFile_v1.0.patch
>
>
> An {{InputSplit}} and {{InputFormat}} implementation for jobs that require to read records from many files. The input is partitioned by files. This can be used for example to implement {{RecordReader}}s which read one record from a file.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-1515) MultiFileSplit,
MultiFileInputFormat
Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-1515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12506915 ]
Hadoop QA commented on HADOOP-1515:
-----------------------------------
+1
http://issues.apache.org/jira/secure/attachment/12360275/multiFile_v1.0.patch applied and successfully tested against trunk revision r549284.
Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/317/testReport/
Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/317/console
> MultiFileSplit, MultiFileInputFormat
> ------------------------------------
>
> Key: HADOOP-1515
> URL: https://issues.apache.org/jira/browse/HADOOP-1515
> Project: Hadoop
> Issue Type: New Feature
> Components: mapred
> Affects Versions: 0.14.0
> Reporter: Enis Soztutar
> Assignee: Enis Soztutar
> Fix For: 0.14.0
>
> Attachments: multiFile_v1.0.patch
>
>
> An {{InputSplit}} and {{InputFormat}} implementation for jobs that require to read records from many files. The input is partitioned by files. This can be used for example to implement {{RecordReader}}s which read one record from a file.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (HADOOP-1515) MultiFileSplit, MultiFileInputFormat
Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-1515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Doug Cutting updated HADOOP-1515:
---------------------------------
Resolution: Fixed
Status: Resolved (was: Patch Available)
I just committed this. Thanks, Enis.
> MultiFileSplit, MultiFileInputFormat
> ------------------------------------
>
> Key: HADOOP-1515
> URL: https://issues.apache.org/jira/browse/HADOOP-1515
> Project: Hadoop
> Issue Type: New Feature
> Components: mapred
> Affects Versions: 0.14.0
> Reporter: Enis Soztutar
> Assignee: Enis Soztutar
> Fix For: 0.14.0
>
> Attachments: multiFile_v1.0.patch, multiFile_v1.1.patch
>
>
> An {{InputSplit}} and {{InputFormat}} implementation for jobs that require to read records from many files. The input is partitioned by files. This can be used for example to implement {{RecordReader}}s which read one record from a file.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (HADOOP-1515) MultiFileSplit, MultiFileInputFormat
Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-1515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Doug Cutting updated HADOOP-1515:
---------------------------------
Status: Open (was: Patch Available)
This looks good to me. But can you please add a unit test? Something like TestSequenceFileInputFormat or TestTextFileInputFormat, that tests the public methods. It doesn't need to run a job. Thanks!
> MultiFileSplit, MultiFileInputFormat
> ------------------------------------
>
> Key: HADOOP-1515
> URL: https://issues.apache.org/jira/browse/HADOOP-1515
> Project: Hadoop
> Issue Type: New Feature
> Components: mapred
> Affects Versions: 0.14.0
> Reporter: Enis Soztutar
> Assignee: Enis Soztutar
> Fix For: 0.14.0
>
> Attachments: multiFile_v1.0.patch
>
>
> An {{InputSplit}} and {{InputFormat}} implementation for jobs that require to read records from many files. The input is partitioned by files. This can be used for example to implement {{RecordReader}}s which read one record from a file.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-1515) MultiFileSplit,
MultiFileInputFormat
Posted by "Hudson (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-1515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508150 ]
Hudson commented on HADOOP-1515:
--------------------------------
Integrated in Hadoop-Nightly #136 (See [http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Nightly/136/])
> MultiFileSplit, MultiFileInputFormat
> ------------------------------------
>
> Key: HADOOP-1515
> URL: https://issues.apache.org/jira/browse/HADOOP-1515
> Project: Hadoop
> Issue Type: New Feature
> Components: mapred
> Affects Versions: 0.14.0
> Reporter: Enis Soztutar
> Assignee: Enis Soztutar
> Fix For: 0.14.0
>
> Attachments: multiFile_v1.0.patch, multiFile_v1.1.patch
>
>
> An {{InputSplit}} and {{InputFormat}} implementation for jobs that require to read records from many files. The input is partitioned by files. This can be used for example to implement {{RecordReader}}s which read one record from a file.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Work started: (HADOOP-1515) MultiFileSplit,
MultiFileInputFormat
Posted by "Enis Soztutar (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-1515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Work on HADOOP-1515 started by Enis Soztutar.
> MultiFileSplit, MultiFileInputFormat
> ------------------------------------
>
> Key: HADOOP-1515
> URL: https://issues.apache.org/jira/browse/HADOOP-1515
> Project: Hadoop
> Issue Type: New Feature
> Components: mapred
> Affects Versions: 0.14.0
> Reporter: Enis Soztutar
> Assignee: Enis Soztutar
> Fix For: 0.14.0
>
> Attachments: multiFile_v1.0.patch
>
>
> An {{InputSplit}} and {{InputFormat}} implementation for jobs that require to read records from many files. The input is partitioned by files. This can be used for example to implement {{RecordReader}}s which read one record from a file.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (HADOOP-1515) MultiFileSplit, MultiFileInputFormat
Posted by "Enis Soztutar (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-1515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Enis Soztutar updated HADOOP-1515:
----------------------------------
Status: Patch Available (was: In Progress)
> MultiFileSplit, MultiFileInputFormat
> ------------------------------------
>
> Key: HADOOP-1515
> URL: https://issues.apache.org/jira/browse/HADOOP-1515
> Project: Hadoop
> Issue Type: New Feature
> Components: mapred
> Affects Versions: 0.14.0
> Reporter: Enis Soztutar
> Assignee: Enis Soztutar
> Fix For: 0.14.0
>
> Attachments: multiFile_v1.0.patch, multiFile_v1.1.patch
>
>
> An {{InputSplit}} and {{InputFormat}} implementation for jobs that require to read records from many files. The input is partitioned by files. This can be used for example to implement {{RecordReader}}s which read one record from a file.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (HADOOP-1515) MultiFileSplit, MultiFileInputFormat
Posted by "Enis Soztutar (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-1515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Enis Soztutar updated HADOOP-1515:
----------------------------------
Attachment: multiFile_v1.1.patch
attaching the patch with the unit test. The test runs in apprx. 30 secs.
> MultiFileSplit, MultiFileInputFormat
> ------------------------------------
>
> Key: HADOOP-1515
> URL: https://issues.apache.org/jira/browse/HADOOP-1515
> Project: Hadoop
> Issue Type: New Feature
> Components: mapred
> Affects Versions: 0.14.0
> Reporter: Enis Soztutar
> Assignee: Enis Soztutar
> Fix For: 0.14.0
>
> Attachments: multiFile_v1.0.patch, multiFile_v1.1.patch
>
>
> An {{InputSplit}} and {{InputFormat}} implementation for jobs that require to read records from many files. The input is partitioned by files. This can be used for example to implement {{RecordReader}}s which read one record from a file.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.