You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Enis Soztutar (JIRA)" <ji...@apache.org> on 2007/06/21 15:19:25 UTC

[jira] Created: (HADOOP-1515) MultiFileSplit, MultiFileInputFormat

MultiFileSplit, MultiFileInputFormat
------------------------------------

                 Key: HADOOP-1515
                 URL: https://issues.apache.org/jira/browse/HADOOP-1515
             Project: Hadoop
          Issue Type: New Feature
          Components: mapred
    Affects Versions: 0.14.0
            Reporter: Enis Soztutar
            Assignee: Enis Soztutar
             Fix For: 0.14.0


An {{InputSplit}} and {{InputFormat}} implementation for jobs that require to read records from many files. The input is partitioned by files. This can be used for example to implement {{RecordReader}}s which read one record from a file. 






-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-1515) MultiFileSplit, MultiFileInputFormat

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-1515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12507947 ] 

Hadoop QA commented on HADOOP-1515:
-----------------------------------

+1

http://issues.apache.org/jira/secure/attachment/12360492/multiFile_v1.1.patch applied and successfully tested against trunk revision r549977.

Test results:   http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/326/testReport/
Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/326/console

> MultiFileSplit, MultiFileInputFormat
> ------------------------------------
>
>                 Key: HADOOP-1515
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1515
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.14.0
>            Reporter: Enis Soztutar
>            Assignee: Enis Soztutar
>             Fix For: 0.14.0
>
>         Attachments: multiFile_v1.0.patch, multiFile_v1.1.patch
>
>
> An {{InputSplit}} and {{InputFormat}} implementation for jobs that require to read records from many files. The input is partitioned by files. This can be used for example to implement {{RecordReader}}s which read one record from a file. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-1515) MultiFileSplit, MultiFileInputFormat

Posted by "Enis Soztutar (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-1515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Enis Soztutar updated HADOOP-1515:
----------------------------------

    Status: Patch Available  (was: Open)

> MultiFileSplit, MultiFileInputFormat
> ------------------------------------
>
>                 Key: HADOOP-1515
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1515
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.14.0
>            Reporter: Enis Soztutar
>            Assignee: Enis Soztutar
>             Fix For: 0.14.0
>
>         Attachments: multiFile_v1.0.patch
>
>
> An {{InputSplit}} and {{InputFormat}} implementation for jobs that require to read records from many files. The input is partitioned by files. This can be used for example to implement {{RecordReader}}s which read one record from a file. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-1515) MultiFileSplit, MultiFileInputFormat

Posted by "Enis Soztutar (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-1515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12507032 ] 

Enis Soztutar commented on HADOOP-1515:
---------------------------------------

> But can you please add a unit test? 
I'll be looking into this ASAP. 

> MultiFileSplit, MultiFileInputFormat
> ------------------------------------
>
>                 Key: HADOOP-1515
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1515
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.14.0
>            Reporter: Enis Soztutar
>            Assignee: Enis Soztutar
>             Fix For: 0.14.0
>
>         Attachments: multiFile_v1.0.patch
>
>
> An {{InputSplit}} and {{InputFormat}} implementation for jobs that require to read records from many files. The input is partitioned by files. This can be used for example to implement {{RecordReader}}s which read one record from a file. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-1515) MultiFileSplit, MultiFileInputFormat

Posted by "Enis Soztutar (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-1515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Enis Soztutar updated HADOOP-1515:
----------------------------------

    Attachment: multiFile_v1.0.patch

{{multiFile_v1.0.patch}}

This patch implements two classes : MultiFileSplit and MultiFileInputFormat. Below are the javadocs : 

{code}
/**
 * A sub-collection of input files. Unlike FileSplit, MultiFileSplit 
 * class does not represent a split of a file, but a split of input files 
 * into smaller sets. The atomic unit of split is a file.  
 * MultiFileSplit can be used to implement RecordReader's, with 
 * reading one record per file.
 */
public class MultiFileSplit implements InputSplit
{code}

and 

{code}
/**
 * An abstract  InputFormat that returns MultiFileSplit's
 * in  #getSplits(JobConf, int) method. Splits are constructed from 
 * the files under the input paths. Each split returned contains nearly
 * equal content length. 
 * Subclasses implement #getRecordReader(InputSplit, JobConf, Reporter)
 * to construct RecordReader's for MultiFileSplit's.
 */
public abstract class MultiFileInputFormat extends FileInputFormat
{code}

I have successfully tested this implementations as a part of a job, containing more than 15k input files, one record per file and 2GB of data. 


> MultiFileSplit, MultiFileInputFormat
> ------------------------------------
>
>                 Key: HADOOP-1515
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1515
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.14.0
>            Reporter: Enis Soztutar
>            Assignee: Enis Soztutar
>             Fix For: 0.14.0
>
>         Attachments: multiFile_v1.0.patch
>
>
> An {{InputSplit}} and {{InputFormat}} implementation for jobs that require to read records from many files. The input is partitioned by files. This can be used for example to implement {{RecordReader}}s which read one record from a file. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-1515) MultiFileSplit, MultiFileInputFormat

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-1515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12506915 ] 

Hadoop QA commented on HADOOP-1515:
-----------------------------------

+1

http://issues.apache.org/jira/secure/attachment/12360275/multiFile_v1.0.patch applied and successfully tested against trunk revision r549284.

Test results:   http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/317/testReport/
Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/317/console

> MultiFileSplit, MultiFileInputFormat
> ------------------------------------
>
>                 Key: HADOOP-1515
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1515
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.14.0
>            Reporter: Enis Soztutar
>            Assignee: Enis Soztutar
>             Fix For: 0.14.0
>
>         Attachments: multiFile_v1.0.patch
>
>
> An {{InputSplit}} and {{InputFormat}} implementation for jobs that require to read records from many files. The input is partitioned by files. This can be used for example to implement {{RecordReader}}s which read one record from a file. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-1515) MultiFileSplit, MultiFileInputFormat

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-1515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doug Cutting updated HADOOP-1515:
---------------------------------

    Resolution: Fixed
        Status: Resolved  (was: Patch Available)

I just committed this.  Thanks, Enis.

> MultiFileSplit, MultiFileInputFormat
> ------------------------------------
>
>                 Key: HADOOP-1515
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1515
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.14.0
>            Reporter: Enis Soztutar
>            Assignee: Enis Soztutar
>             Fix For: 0.14.0
>
>         Attachments: multiFile_v1.0.patch, multiFile_v1.1.patch
>
>
> An {{InputSplit}} and {{InputFormat}} implementation for jobs that require to read records from many files. The input is partitioned by files. This can be used for example to implement {{RecordReader}}s which read one record from a file. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-1515) MultiFileSplit, MultiFileInputFormat

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-1515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doug Cutting updated HADOOP-1515:
---------------------------------

    Status: Open  (was: Patch Available)

This looks good to me.  But can you please add a unit test?  Something like TestSequenceFileInputFormat or TestTextFileInputFormat, that tests the public methods.  It doesn't need to run a job.  Thanks!

> MultiFileSplit, MultiFileInputFormat
> ------------------------------------
>
>                 Key: HADOOP-1515
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1515
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.14.0
>            Reporter: Enis Soztutar
>            Assignee: Enis Soztutar
>             Fix For: 0.14.0
>
>         Attachments: multiFile_v1.0.patch
>
>
> An {{InputSplit}} and {{InputFormat}} implementation for jobs that require to read records from many files. The input is partitioned by files. This can be used for example to implement {{RecordReader}}s which read one record from a file. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-1515) MultiFileSplit, MultiFileInputFormat

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-1515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508150 ] 

Hudson commented on HADOOP-1515:
--------------------------------

Integrated in Hadoop-Nightly #136 (See [http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Nightly/136/])

> MultiFileSplit, MultiFileInputFormat
> ------------------------------------
>
>                 Key: HADOOP-1515
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1515
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.14.0
>            Reporter: Enis Soztutar
>            Assignee: Enis Soztutar
>             Fix For: 0.14.0
>
>         Attachments: multiFile_v1.0.patch, multiFile_v1.1.patch
>
>
> An {{InputSplit}} and {{InputFormat}} implementation for jobs that require to read records from many files. The input is partitioned by files. This can be used for example to implement {{RecordReader}}s which read one record from a file. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Work started: (HADOOP-1515) MultiFileSplit, MultiFileInputFormat

Posted by "Enis Soztutar (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-1515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Work on HADOOP-1515 started by Enis Soztutar.

> MultiFileSplit, MultiFileInputFormat
> ------------------------------------
>
>                 Key: HADOOP-1515
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1515
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.14.0
>            Reporter: Enis Soztutar
>            Assignee: Enis Soztutar
>             Fix For: 0.14.0
>
>         Attachments: multiFile_v1.0.patch
>
>
> An {{InputSplit}} and {{InputFormat}} implementation for jobs that require to read records from many files. The input is partitioned by files. This can be used for example to implement {{RecordReader}}s which read one record from a file. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-1515) MultiFileSplit, MultiFileInputFormat

Posted by "Enis Soztutar (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-1515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Enis Soztutar updated HADOOP-1515:
----------------------------------

    Status: Patch Available  (was: In Progress)

> MultiFileSplit, MultiFileInputFormat
> ------------------------------------
>
>                 Key: HADOOP-1515
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1515
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.14.0
>            Reporter: Enis Soztutar
>            Assignee: Enis Soztutar
>             Fix For: 0.14.0
>
>         Attachments: multiFile_v1.0.patch, multiFile_v1.1.patch
>
>
> An {{InputSplit}} and {{InputFormat}} implementation for jobs that require to read records from many files. The input is partitioned by files. This can be used for example to implement {{RecordReader}}s which read one record from a file. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-1515) MultiFileSplit, MultiFileInputFormat

Posted by "Enis Soztutar (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-1515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Enis Soztutar updated HADOOP-1515:
----------------------------------

    Attachment: multiFile_v1.1.patch

attaching the patch with the unit test. The test runs in apprx. 30 secs. 


> MultiFileSplit, MultiFileInputFormat
> ------------------------------------
>
>                 Key: HADOOP-1515
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1515
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.14.0
>            Reporter: Enis Soztutar
>            Assignee: Enis Soztutar
>             Fix For: 0.14.0
>
>         Attachments: multiFile_v1.0.patch, multiFile_v1.1.patch
>
>
> An {{InputSplit}} and {{InputFormat}} implementation for jobs that require to read records from many files. The input is partitioned by files. This can be used for example to implement {{RecordReader}}s which read one record from a file. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.