You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Milind Bhandarkar (JIRA)" <ji...@apache.org> on 2008/04/09 22:50:08 UTC

[jira] Created: (HADOOP-3221) Need a "LineBasedTextInputFormat"

Need a "LineBasedTextInputFormat"
---------------------------------

                 Key: HADOOP-3221
                 URL: https://issues.apache.org/jira/browse/HADOOP-3221
             Project: Hadoop Core
          Issue Type: New Feature
          Components: mapred
    Affects Versions: 0.16.2
         Environment: All
            Reporter: Milind Bhandarkar


In many "pleasantly" parallel applications, each process/mapper processes the same input file (s), but with computations are controlled by different parameters.
(Referred to as "parameter sweeps").

One way to achieve this, is to specify a set of parameters (one set per line) as input in a control file (which is the input path to the map-reduce application, where as the input dataset is specified via a config variable in JobConf.).

It would be great to have an InputFormat, that splits the input file such that by default, one line is fed as a value to one map task, and key could be line number. i.e. (k,v) is (LongWritable, Text).

If user specifies the number of maps explicitly, each mapper should get a contiguous chunk of lines (so as to load balance between the mappers.)

The location hints for the splits should not be derived from the input file, but rather, should span the whole mapred cluster.

(Is there a way to do this without having to return an array of nSplits*nTaskTrackers ?)

Increasing the replication of the "real" input dataset (since it will be fetched by all the nodes) is orthogonal, and one can use DistributedCache for that.

(P.S. Please chose a better name for this InputFormat. I am not in love with  "LineBasedText" name.)


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-3221) Need a "LineBasedTextInputFormat"

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-3221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Devaraj Das updated HADOOP-3221:
--------------------------------

    Resolution: Fixed
        Status: Resolved  (was: Patch Available)

I just committed this. Thanks, Amareshwari!

> Need a "LineBasedTextInputFormat"
> ---------------------------------
>
>                 Key: HADOOP-3221
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3221
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.16.2
>         Environment: All
>            Reporter: Milind Bhandarkar
>            Assignee: Amareshwari Sriramadasu
>             Fix For: 0.18.0
>
>         Attachments: patch-3221-1.txt, patch-3221-2.txt, patch-3221.txt
>
>
> In many "pleasantly" parallel applications, each process/mapper processes the same input file (s), but with computations are controlled by different parameters.
> (Referred to as "parameter sweeps").
> One way to achieve this, is to specify a set of parameters (one set per line) as input in a control file (which is the input path to the map-reduce application, where as the input dataset is specified via a config variable in JobConf.).
> It would be great to have an InputFormat, that splits the input file such that by default, one line is fed as a value to one map task, and key could be line number. i.e. (k,v) is (LongWritable, Text).
> If user specifies the number of maps explicitly, each mapper should get a contiguous chunk of lines (so as to load balance between the mappers.)
> The location hints for the splits should not be derived from the input file, but rather, should span the whole mapred cluster.
> (Is there a way to do this without having to return an array of nSplits*nTaskTrackers ?)
> Increasing the replication of the "real" input dataset (since it will be fetched by all the nodes) is orthogonal, and one can use DistributedCache for that.
> (P.S. Please chose a better name for this InputFormat. I am not in love with  "LineBasedText" name.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-3221) Need a "LineBasedTextInputFormat"

Posted by "Amareshwari Sriramadasu (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-3221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Amareshwari Sriramadasu updated HADOOP-3221:
--------------------------------------------

    Attachment: patch-3221-2.txt

Here is patch adding org.apache.hadoop.mapred.lib.NLineInputFormat, which splits N lines of input file as one split. 
N is specifed using config  variable "mapred.line.input.format.linespermap", defaults to 1.  In files with a number of lines not evenly divided by N, the last split constructed from that file would have number lines lines less than N.

NLineInputFormat constructs FileSplits containing N lines in each map. And Uses LineRecordReader to read the lines from the split.

> Need a "LineBasedTextInputFormat"
> ---------------------------------
>
>                 Key: HADOOP-3221
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3221
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.16.2
>         Environment: All
>            Reporter: Milind Bhandarkar
>            Assignee: Amareshwari Sriramadasu
>             Fix For: 0.18.0
>
>         Attachments: patch-3221-1.txt, patch-3221-2.txt, patch-3221.txt
>
>
> In many "pleasantly" parallel applications, each process/mapper processes the same input file (s), but with computations are controlled by different parameters.
> (Referred to as "parameter sweeps").
> One way to achieve this, is to specify a set of parameters (one set per line) as input in a control file (which is the input path to the map-reduce application, where as the input dataset is specified via a config variable in JobConf.).
> It would be great to have an InputFormat, that splits the input file such that by default, one line is fed as a value to one map task, and key could be line number. i.e. (k,v) is (LongWritable, Text).
> If user specifies the number of maps explicitly, each mapper should get a contiguous chunk of lines (so as to load balance between the mappers.)
> The location hints for the splits should not be derived from the input file, but rather, should span the whole mapred cluster.
> (Is there a way to do this without having to return an array of nSplits*nTaskTrackers ?)
> Increasing the replication of the "real" input dataset (since it will be fetched by all the nodes) is orthogonal, and one can use DistributedCache for that.
> (P.S. Please chose a better name for this InputFormat. I am not in love with  "LineBasedText" name.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-3221) Need a "LineBasedTextInputFormat"

Posted by "Amareshwari Sriramadasu (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-3221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12594465#action_12594465 ] 

Amareshwari Sriramadasu commented on HADOOP-3221:
-------------------------------------------------


Here is design for the proposed LineBasedTextInputFormat:

We can have an NLineInputFormat that splits the input file such that N lines are made as one split, where N defaults to 1. N can be derived from number of maps as total_number_of_lines/number_of_maps.

bq. The location hints for the splits should not be derived from the input file, but rather, should span the whole mapred cluster.
I think this can be done by returning an empty array for InputSplit.getLocations(). 

And for making the split contain the actual lines themselves instead of <filename, start-offset, length>, InputSplit.write and read methods can be overridden to do this. And a RecordReader should be implemented to read the contents of the split.

Thoughts?


> Need a "LineBasedTextInputFormat"
> ---------------------------------
>
>                 Key: HADOOP-3221
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3221
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.16.2
>         Environment: All
>            Reporter: Milind Bhandarkar
>
> In many "pleasantly" parallel applications, each process/mapper processes the same input file (s), but with computations are controlled by different parameters.
> (Referred to as "parameter sweeps").
> One way to achieve this, is to specify a set of parameters (one set per line) as input in a control file (which is the input path to the map-reduce application, where as the input dataset is specified via a config variable in JobConf.).
> It would be great to have an InputFormat, that splits the input file such that by default, one line is fed as a value to one map task, and key could be line number. i.e. (k,v) is (LongWritable, Text).
> If user specifies the number of maps explicitly, each mapper should get a contiguous chunk of lines (so as to load balance between the mappers.)
> The location hints for the splits should not be derived from the input file, but rather, should span the whole mapred cluster.
> (Is there a way to do this without having to return an array of nSplits*nTaskTrackers ?)
> Increasing the replication of the "real" input dataset (since it will be fetched by all the nodes) is orthogonal, and one can use DistributedCache for that.
> (P.S. Please chose a better name for this InputFormat. I am not in love with  "LineBasedText" name.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-3221) Need a "LineBasedTextInputFormat"

Posted by "lohit vijayarenu (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-3221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12597024#action_12597024 ] 

lohit vijayarenu commented on HADOOP-3221:
------------------------------------------

bq. N is specifed using config variable "mapred.line.input.format.linespermap", defaults to 1. In files with a number of lines not evenly divided by N, the last split constructed from that file would have number lines lines less than N.

+1 on this approach

> Need a "LineBasedTextInputFormat"
> ---------------------------------
>
>                 Key: HADOOP-3221
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3221
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.16.2
>         Environment: All
>            Reporter: Milind Bhandarkar
>            Assignee: Amareshwari Sriramadasu
>             Fix For: 0.18.0
>
>         Attachments: patch-3221-1.txt, patch-3221-2.txt, patch-3221.txt
>
>
> In many "pleasantly" parallel applications, each process/mapper processes the same input file (s), but with computations are controlled by different parameters.
> (Referred to as "parameter sweeps").
> One way to achieve this, is to specify a set of parameters (one set per line) as input in a control file (which is the input path to the map-reduce application, where as the input dataset is specified via a config variable in JobConf.).
> It would be great to have an InputFormat, that splits the input file such that by default, one line is fed as a value to one map task, and key could be line number. i.e. (k,v) is (LongWritable, Text).
> If user specifies the number of maps explicitly, each mapper should get a contiguous chunk of lines (so as to load balance between the mappers.)
> The location hints for the splits should not be derived from the input file, but rather, should span the whole mapred cluster.
> (Is there a way to do this without having to return an array of nSplits*nTaskTrackers ?)
> Increasing the replication of the "real" input dataset (since it will be fetched by all the nodes) is orthogonal, and one can use DistributedCache for that.
> (P.S. Please chose a better name for this InputFormat. I am not in love with  "LineBasedText" name.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-3221) Need a "LineBasedTextInputFormat"

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-3221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12597459#action_12597459 ] 

Hudson commented on HADOOP-3221:
--------------------------------

Integrated in Hadoop-trunk #493 (See [http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/493/])

> Need a "LineBasedTextInputFormat"
> ---------------------------------
>
>                 Key: HADOOP-3221
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3221
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.16.2
>         Environment: All
>            Reporter: Milind Bhandarkar
>            Assignee: Amareshwari Sriramadasu
>             Fix For: 0.18.0
>
>         Attachments: patch-3221-1.txt, patch-3221-2.txt, patch-3221.txt
>
>
> In many "pleasantly" parallel applications, each process/mapper processes the same input file (s), but with computations are controlled by different parameters.
> (Referred to as "parameter sweeps").
> One way to achieve this, is to specify a set of parameters (one set per line) as input in a control file (which is the input path to the map-reduce application, where as the input dataset is specified via a config variable in JobConf.).
> It would be great to have an InputFormat, that splits the input file such that by default, one line is fed as a value to one map task, and key could be line number. i.e. (k,v) is (LongWritable, Text).
> If user specifies the number of maps explicitly, each mapper should get a contiguous chunk of lines (so as to load balance between the mappers.)
> The location hints for the splits should not be derived from the input file, but rather, should span the whole mapred cluster.
> (Is there a way to do this without having to return an array of nSplits*nTaskTrackers ?)
> Increasing the replication of the "real" input dataset (since it will be fetched by all the nodes) is orthogonal, and one can use DistributedCache for that.
> (P.S. Please chose a better name for this InputFormat. I am not in love with  "LineBasedText" name.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-3221) Need a "LineBasedTextInputFormat"

Posted by "Amareshwari Sriramadasu (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-3221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Amareshwari Sriramadasu updated HADOOP-3221:
--------------------------------------------

    Fix Version/s: 0.18.0
           Status: Patch Available  (was: Open)

> Need a "LineBasedTextInputFormat"
> ---------------------------------
>
>                 Key: HADOOP-3221
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3221
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.16.2
>         Environment: All
>            Reporter: Milind Bhandarkar
>            Assignee: Amareshwari Sriramadasu
>             Fix For: 0.18.0
>
>         Attachments: patch-3221.txt
>
>
> In many "pleasantly" parallel applications, each process/mapper processes the same input file (s), but with computations are controlled by different parameters.
> (Referred to as "parameter sweeps").
> One way to achieve this, is to specify a set of parameters (one set per line) as input in a control file (which is the input path to the map-reduce application, where as the input dataset is specified via a config variable in JobConf.).
> It would be great to have an InputFormat, that splits the input file such that by default, one line is fed as a value to one map task, and key could be line number. i.e. (k,v) is (LongWritable, Text).
> If user specifies the number of maps explicitly, each mapper should get a contiguous chunk of lines (so as to load balance between the mappers.)
> The location hints for the splits should not be derived from the input file, but rather, should span the whole mapred cluster.
> (Is there a way to do this without having to return an array of nSplits*nTaskTrackers ?)
> Increasing the replication of the "real" input dataset (since it will be fetched by all the nodes) is orthogonal, and one can use DistributedCache for that.
> (P.S. Please chose a better name for this InputFormat. I am not in love with  "LineBasedText" name.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-3221) Need a "LineBasedTextInputFormat"

Posted by "Amareshwari Sriramadasu (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-3221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Amareshwari Sriramadasu updated HADOOP-3221:
--------------------------------------------

    Attachment: patch-3221-1.txt

Fixed findbugs warnings

> Need a "LineBasedTextInputFormat"
> ---------------------------------
>
>                 Key: HADOOP-3221
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3221
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.16.2
>         Environment: All
>            Reporter: Milind Bhandarkar
>            Assignee: Amareshwari Sriramadasu
>             Fix For: 0.18.0
>
>         Attachments: patch-3221-1.txt, patch-3221.txt
>
>
> In many "pleasantly" parallel applications, each process/mapper processes the same input file (s), but with computations are controlled by different parameters.
> (Referred to as "parameter sweeps").
> One way to achieve this, is to specify a set of parameters (one set per line) as input in a control file (which is the input path to the map-reduce application, where as the input dataset is specified via a config variable in JobConf.).
> It would be great to have an InputFormat, that splits the input file such that by default, one line is fed as a value to one map task, and key could be line number. i.e. (k,v) is (LongWritable, Text).
> If user specifies the number of maps explicitly, each mapper should get a contiguous chunk of lines (so as to load balance between the mappers.)
> The location hints for the splits should not be derived from the input file, but rather, should span the whole mapred cluster.
> (Is there a way to do this without having to return an array of nSplits*nTaskTrackers ?)
> Increasing the replication of the "real" input dataset (since it will be fetched by all the nodes) is orthogonal, and one can use DistributedCache for that.
> (P.S. Please chose a better name for this InputFormat. I am not in love with  "LineBasedText" name.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-3221) Need a "LineBasedTextInputFormat"

Posted by "Milind Bhandarkar (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-3221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12587354#action_12587354 ] 

Milind Bhandarkar commented on HADOOP-3221:
-------------------------------------------

Arkady,

OneLineInputFormat is not checked into Hadoop yet. I created this jira, so that we can make the necessary modifications (e.g. LocationHints, adjusting to numMappers etc), and contribute it.


> Need a "LineBasedTextInputFormat"
> ---------------------------------
>
>                 Key: HADOOP-3221
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3221
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.16.2
>         Environment: All
>            Reporter: Milind Bhandarkar
>
> In many "pleasantly" parallel applications, each process/mapper processes the same input file (s), but with computations are controlled by different parameters.
> (Referred to as "parameter sweeps").
> One way to achieve this, is to specify a set of parameters (one set per line) as input in a control file (which is the input path to the map-reduce application, where as the input dataset is specified via a config variable in JobConf.).
> It would be great to have an InputFormat, that splits the input file such that by default, one line is fed as a value to one map task, and key could be line number. i.e. (k,v) is (LongWritable, Text).
> If user specifies the number of maps explicitly, each mapper should get a contiguous chunk of lines (so as to load balance between the mappers.)
> The location hints for the splits should not be derived from the input file, but rather, should span the whole mapred cluster.
> (Is there a way to do this without having to return an array of nSplits*nTaskTrackers ?)
> Increasing the replication of the "real" input dataset (since it will be fetched by all the nodes) is orthogonal, and one can use DistributedCache for that.
> (P.S. Please chose a better name for this InputFormat. I am not in love with  "LineBasedText" name.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-3221) Need a "LineBasedTextInputFormat"

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-3221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12596219#action_12596219 ] 

Chris Douglas commented on HADOOP-3221:
---------------------------------------

bq. A pass over the input files containing the lines will tell us how many lines there are. The number of maps that the user desires will give us the number of lines per map (goalsize). The offsets in the input files can then be derived in a second pass over the input files (with the pass breaking at file boundaries just like the FileSplit case).

For applications with one map per line of text (depressingly many, particularly for prototypes and research projects), the approach this patch takes makes some sense. For a line length of 40 to 100 characters, a FileSplit- even sans location information- is likely no smaller than the data it describes. Given this potential advantage, there are at least two cases that this implementation includes that work against that model. The first, obviously, is large files; a property defining the maximum aggregate file size is pretty much required to prevent accidents. The second is specifying the number of maps and getting splits with an even number of lines. That adds little value over the default, since in practice most inputs will have fairly uniform line lengths; the estimates should be very close, so the second pass has limited value. If one wants to use multiple lines per map for load balancing only, then generating splits in the usual way is sufficient, unless the "line number as key" is a requirement and the offset isn't enough.

The purpose of this class would be much clearer if the user were required to provide N. I think it's OK to read the lines into the splits, as long as the total size is kept low. Ideally, this would mix stripped-down FileSpits with LineSplits (line literals) based on size, but that's probably overdoing it. It's probably sufficient to add a (starting) line number to LineSplit, add safety checks for maximum input size, and change its behavior to be N lines per split, rather than the current behavior. Thoughts? I think this should satisfy the requirements and- at least to me- clarifies and narrows where this new InputFormat may be used.

> Need a "LineBasedTextInputFormat"
> ---------------------------------
>
>                 Key: HADOOP-3221
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3221
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.16.2
>         Environment: All
>            Reporter: Milind Bhandarkar
>            Assignee: Amareshwari Sriramadasu
>             Fix For: 0.18.0
>
>         Attachments: patch-3221-1.txt, patch-3221.txt
>
>
> In many "pleasantly" parallel applications, each process/mapper processes the same input file (s), but with computations are controlled by different parameters.
> (Referred to as "parameter sweeps").
> One way to achieve this, is to specify a set of parameters (one set per line) as input in a control file (which is the input path to the map-reduce application, where as the input dataset is specified via a config variable in JobConf.).
> It would be great to have an InputFormat, that splits the input file such that by default, one line is fed as a value to one map task, and key could be line number. i.e. (k,v) is (LongWritable, Text).
> If user specifies the number of maps explicitly, each mapper should get a contiguous chunk of lines (so as to load balance between the mappers.)
> The location hints for the splits should not be derived from the input file, but rather, should span the whole mapred cluster.
> (Is there a way to do this without having to return an array of nSplits*nTaskTrackers ?)
> Increasing the replication of the "real" input dataset (since it will be fetched by all the nodes) is orthogonal, and one can use DistributedCache for that.
> (P.S. Please chose a better name for this InputFormat. I am not in love with  "LineBasedText" name.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-3221) Need a "LineBasedTextInputFormat"

Posted by "Amareshwari Sriramadasu (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-3221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Amareshwari Sriramadasu updated HADOOP-3221:
--------------------------------------------

    Release Note: Adds org.apache.hadoop.mapred.lib.NLineInputFormat ,which splits N lines of input as one split. N can be specified by configuration property "mapred.line.input.format.linespermap", which defaults to 1.

> Need a "LineBasedTextInputFormat"
> ---------------------------------
>
>                 Key: HADOOP-3221
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3221
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.16.2
>         Environment: All
>            Reporter: Milind Bhandarkar
>            Assignee: Amareshwari Sriramadasu
>             Fix For: 0.18.0
>
>         Attachments: patch-3221-1.txt, patch-3221-2.txt, patch-3221.txt
>
>
> In many "pleasantly" parallel applications, each process/mapper processes the same input file (s), but with computations are controlled by different parameters.
> (Referred to as "parameter sweeps").
> One way to achieve this, is to specify a set of parameters (one set per line) as input in a control file (which is the input path to the map-reduce application, where as the input dataset is specified via a config variable in JobConf.).
> It would be great to have an InputFormat, that splits the input file such that by default, one line is fed as a value to one map task, and key could be line number. i.e. (k,v) is (LongWritable, Text).
> If user specifies the number of maps explicitly, each mapper should get a contiguous chunk of lines (so as to load balance between the mappers.)
> The location hints for the splits should not be derived from the input file, but rather, should span the whole mapred cluster.
> (Is there a way to do this without having to return an array of nSplits*nTaskTrackers ?)
> Increasing the replication of the "real" input dataset (since it will be fetched by all the nodes) is orthogonal, and one can use DistributedCache for that.
> (P.S. Please chose a better name for this InputFormat. I am not in love with  "LineBasedText" name.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-3221) Need a "LineBasedTextInputFormat"

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-3221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris Douglas updated HADOOP-3221:
----------------------------------

    Status: Open  (was: Patch Available)

This implements something slightly different than the requirements as stated, i.e. it takes input file(s) and encodes each line (or a subset of lines) as a split, rather than specifying a partition of a resource with one split per line. This has some clear advantages for the issue at hand, i.e. one map per line of text, where a vanilla FileSplit is likely as large (path + offsets + locations) as the relevant line of text, and placement avoids being misled.

That said, slurping all the input files and writing their contents into the splits may not be the best approach. The result is likely to be close to guessing even offsets into each input (without reading each file), and while there's a possible space savings if both the line length and N are small, it's close enough that the value added may not distinguish it from an InputFormat returning closely cropped FileSplits, stripped of locations. The use and purpose of this new InputFormat might be clearer (though not what this patch implements) if one set a property that governs how many lines are in each split (defaulting to 1).\* Since the JobTracker has to read in all the splits (and hold them in memory for the duration of the job, limiting the size of the file the user points this at would be a good idea (via a property that- if said user felt daring or malicious- he could cast off). If you felt daring, you could even mix stripped-down FileSplits with LineSplits based on the length of each section, since the classname of each split is encoded into job.splits.

A few nits:
* This should be in o.a.h.mapred.lib, not o.a.h.mapred
* Since the map expects Text, LineSplit might as well keep Text[] rather than String[]
* It might be worthwhile to use LineRecordReader instead of InputStreamReader
* I'm fairly certain that "line number" should not be local to the split, but either the line number in the original input file or an offset into that file.

\* Semantically, it's not clear how to regard files with a number of lines not evenly divided by N; the current patch would group lines from different files into the same split, which might not be what users would expect, but the particular choice is not critical as long as it's documented.

> Need a "LineBasedTextInputFormat"
> ---------------------------------
>
>                 Key: HADOOP-3221
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3221
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.16.2
>         Environment: All
>            Reporter: Milind Bhandarkar
>            Assignee: Amareshwari Sriramadasu
>             Fix For: 0.18.0
>
>         Attachments: patch-3221-1.txt, patch-3221.txt
>
>
> In many "pleasantly" parallel applications, each process/mapper processes the same input file (s), but with computations are controlled by different parameters.
> (Referred to as "parameter sweeps").
> One way to achieve this, is to specify a set of parameters (one set per line) as input in a control file (which is the input path to the map-reduce application, where as the input dataset is specified via a config variable in JobConf.).
> It would be great to have an InputFormat, that splits the input file such that by default, one line is fed as a value to one map task, and key could be line number. i.e. (k,v) is (LongWritable, Text).
> If user specifies the number of maps explicitly, each mapper should get a contiguous chunk of lines (so as to load balance between the mappers.)
> The location hints for the splits should not be derived from the input file, but rather, should span the whole mapred cluster.
> (Is there a way to do this without having to return an array of nSplits*nTaskTrackers ?)
> Increasing the replication of the "real" input dataset (since it will be fetched by all the nodes) is orthogonal, and one can use DistributedCache for that.
> (P.S. Please chose a better name for this InputFormat. I am not in love with  "LineBasedText" name.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-3221) Need a "LineBasedTextInputFormat"

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-3221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12595694#action_12595694 ] 

Hadoop QA commented on HADOOP-3221:
-----------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12381750/patch-3221-1.txt
  against trunk revision 654315.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 3 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    -1 core tests.  The patch failed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2439/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2439/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2439/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2439/console

This message is automatically generated.

> Need a "LineBasedTextInputFormat"
> ---------------------------------
>
>                 Key: HADOOP-3221
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3221
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.16.2
>         Environment: All
>            Reporter: Milind Bhandarkar
>            Assignee: Amareshwari Sriramadasu
>             Fix For: 0.18.0
>
>         Attachments: patch-3221-1.txt, patch-3221.txt
>
>
> In many "pleasantly" parallel applications, each process/mapper processes the same input file (s), but with computations are controlled by different parameters.
> (Referred to as "parameter sweeps").
> One way to achieve this, is to specify a set of parameters (one set per line) as input in a control file (which is the input path to the map-reduce application, where as the input dataset is specified via a config variable in JobConf.).
> It would be great to have an InputFormat, that splits the input file such that by default, one line is fed as a value to one map task, and key could be line number. i.e. (k,v) is (LongWritable, Text).
> If user specifies the number of maps explicitly, each mapper should get a contiguous chunk of lines (so as to load balance between the mappers.)
> The location hints for the splits should not be derived from the input file, but rather, should span the whole mapred cluster.
> (Is there a way to do this without having to return an array of nSplits*nTaskTrackers ?)
> Increasing the replication of the "real" input dataset (since it will be fetched by all the nodes) is orthogonal, and one can use DistributedCache for that.
> (P.S. Please chose a better name for this InputFormat. I am not in love with  "LineBasedText" name.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-3221) Need a "LineBasedTextInputFormat"

Posted by "Arkady Borkovsky (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-3221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12587350#action_12587350 ] 

Arkady Borkovsky commented on HADOOP-3221:
------------------------------------------

I think we already have it in 0.16.2 as  OneLineInputFormat

> Need a "LineBasedTextInputFormat"
> ---------------------------------
>
>                 Key: HADOOP-3221
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3221
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.16.2
>         Environment: All
>            Reporter: Milind Bhandarkar
>
> In many "pleasantly" parallel applications, each process/mapper processes the same input file (s), but with computations are controlled by different parameters.
> (Referred to as "parameter sweeps").
> One way to achieve this, is to specify a set of parameters (one set per line) as input in a control file (which is the input path to the map-reduce application, where as the input dataset is specified via a config variable in JobConf.).
> It would be great to have an InputFormat, that splits the input file such that by default, one line is fed as a value to one map task, and key could be line number. i.e. (k,v) is (LongWritable, Text).
> If user specifies the number of maps explicitly, each mapper should get a contiguous chunk of lines (so as to load balance between the mappers.)
> The location hints for the splits should not be derived from the input file, but rather, should span the whole mapred cluster.
> (Is there a way to do this without having to return an array of nSplits*nTaskTrackers ?)
> Increasing the replication of the "real" input dataset (since it will be fetched by all the nodes) is orthogonal, and one can use DistributedCache for that.
> (P.S. Please chose a better name for this InputFormat. I am not in love with  "LineBasedText" name.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-3221) Need a "LineBasedTextInputFormat"

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-3221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12596007#action_12596007 ] 

Devaraj Das commented on HADOOP-3221:
-------------------------------------

I agree with Chris that the JobTracker shouldn't load the lines into memory. I think we should make this work with FileSplit (minus the locations info). A pass over the input files containing the lines will tell us how many lines there are. The number of maps that the user desires will give us the number of lines per map (goalsize). The offsets in the input files can then be derived in a second pass over the input files (with the pass breaking at file boundaries just like the FileSplit case). Would this satisfy the requirements?

> Need a "LineBasedTextInputFormat"
> ---------------------------------
>
>                 Key: HADOOP-3221
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3221
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.16.2
>         Environment: All
>            Reporter: Milind Bhandarkar
>            Assignee: Amareshwari Sriramadasu
>             Fix For: 0.18.0
>
>         Attachments: patch-3221-1.txt, patch-3221.txt
>
>
> In many "pleasantly" parallel applications, each process/mapper processes the same input file (s), but with computations are controlled by different parameters.
> (Referred to as "parameter sweeps").
> One way to achieve this, is to specify a set of parameters (one set per line) as input in a control file (which is the input path to the map-reduce application, where as the input dataset is specified via a config variable in JobConf.).
> It would be great to have an InputFormat, that splits the input file such that by default, one line is fed as a value to one map task, and key could be line number. i.e. (k,v) is (LongWritable, Text).
> If user specifies the number of maps explicitly, each mapper should get a contiguous chunk of lines (so as to load balance between the mappers.)
> The location hints for the splits should not be derived from the input file, but rather, should span the whole mapred cluster.
> (Is there a way to do this without having to return an array of nSplits*nTaskTrackers ?)
> Increasing the replication of the "real" input dataset (since it will be fetched by all the nodes) is orthogonal, and one can use DistributedCache for that.
> (P.S. Please chose a better name for this InputFormat. I am not in love with  "LineBasedText" name.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Assigned: (HADOOP-3221) Need a "LineBasedTextInputFormat"

Posted by "Amareshwari Sriramadasu (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-3221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Amareshwari Sriramadasu reassigned HADOOP-3221:
-----------------------------------------------

    Assignee: Amareshwari Sriramadasu

> Need a "LineBasedTextInputFormat"
> ---------------------------------
>
>                 Key: HADOOP-3221
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3221
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.16.2
>         Environment: All
>            Reporter: Milind Bhandarkar
>            Assignee: Amareshwari Sriramadasu
>
> In many "pleasantly" parallel applications, each process/mapper processes the same input file (s), but with computations are controlled by different parameters.
> (Referred to as "parameter sweeps").
> One way to achieve this, is to specify a set of parameters (one set per line) as input in a control file (which is the input path to the map-reduce application, where as the input dataset is specified via a config variable in JobConf.).
> It would be great to have an InputFormat, that splits the input file such that by default, one line is fed as a value to one map task, and key could be line number. i.e. (k,v) is (LongWritable, Text).
> If user specifies the number of maps explicitly, each mapper should get a contiguous chunk of lines (so as to load balance between the mappers.)
> The location hints for the splits should not be derived from the input file, but rather, should span the whole mapred cluster.
> (Is there a way to do this without having to return an array of nSplits*nTaskTrackers ?)
> Increasing the replication of the "real" input dataset (since it will be fetched by all the nodes) is orthogonal, and one can use DistributedCache for that.
> (P.S. Please chose a better name for this InputFormat. I am not in love with  "LineBasedText" name.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-3221) Need a "LineBasedTextInputFormat"

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-3221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12596374#action_12596374 ] 

Devaraj Das commented on HADOOP-3221:
-------------------------------------

I am tending to think that FileSplit based approach is the better one. The reasons:
1) We don't invent brand new input formats. We reuse what exists and the amount of new code is minimal (at a high level, it seems like only FileInputFormat.getSplits and FileSplit.getLocations needs to be overridden)
2) We are better at handling the cases of large files. Granted that with 1 line per map, we might have the same problem with FileSplit. But we could work around that by having a larger N.
3) We don't make assumptions about the line lengths, etc. Just make one pass over the files and arrive at the splits.

The only issue is that we might end up in a situation where a couple of datanodes in the cluster becomes a bottleneck for the split serving. But that could be handled by having a higher replication factor of such files (just like we handle job.jar, etc.). 

Thoughts?

> Need a "LineBasedTextInputFormat"
> ---------------------------------
>
>                 Key: HADOOP-3221
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3221
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.16.2
>         Environment: All
>            Reporter: Milind Bhandarkar
>            Assignee: Amareshwari Sriramadasu
>             Fix For: 0.18.0
>
>         Attachments: patch-3221-1.txt, patch-3221.txt
>
>
> In many "pleasantly" parallel applications, each process/mapper processes the same input file (s), but with computations are controlled by different parameters.
> (Referred to as "parameter sweeps").
> One way to achieve this, is to specify a set of parameters (one set per line) as input in a control file (which is the input path to the map-reduce application, where as the input dataset is specified via a config variable in JobConf.).
> It would be great to have an InputFormat, that splits the input file such that by default, one line is fed as a value to one map task, and key could be line number. i.e. (k,v) is (LongWritable, Text).
> If user specifies the number of maps explicitly, each mapper should get a contiguous chunk of lines (so as to load balance between the mappers.)
> The location hints for the splits should not be derived from the input file, but rather, should span the whole mapred cluster.
> (Is there a way to do this without having to return an array of nSplits*nTaskTrackers ?)
> Increasing the replication of the "real" input dataset (since it will be fetched by all the nodes) is orthogonal, and one can use DistributedCache for that.
> (P.S. Please chose a better name for this InputFormat. I am not in love with  "LineBasedText" name.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-3221) Need a "LineBasedTextInputFormat"

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-3221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12596517#action_12596517 ] 

Chris Douglas commented on HADOOP-3221:
---------------------------------------

bq. We don't invent brand new input formats. We reuse what exists and the amount of new code is minimal

Which is why this would reuse LineRecordReader to handle compression for the split generation, etc.

bq. We are better at handling the cases of large files. Granted that with 1 line per map, we might have the same problem with FileSplit. But we could work around that by having a larger N.

That's why this was requested. Our model handles large files, but users want to create maps initialized with a handful of parameters defined in a text file and executed at arbitrary points on the cluster. I'm skeptical of this model, but it's an idiom used often enough to justify a new InputFormat. It only makes sense when N is small (in practice, N=1 most of the time) and specified by the user and when the file is small. The existing code covers the other cases.

bq. The only issue is that we might end up in a situation where a couple of datanodes in the cluster becomes a bottleneck for the split serving

That's not likely to be a bottleneck for these jobs. The optimization isn't just for split serving, but also potentially to the size of the split. Doing this with FileSplits sans locations will probably end up with an average 70-120 bytes per split, right? If the lines are shorter, then embedding them in the split is a win. If it's within 10-20% of that size, it's probably still worth doing. It becomes less attractive as it converges to the cases we already cover.

bq. We don't make assumptions about the line lengths, etc. Just make one pass over the files and arrive at the splits.

Both require a pass for the line numbers, if that's a requirement.

A lot seems to hinge on this. If it is a requirement that the path be included, then there's no longer any real advantage to embedding the line with the split. If users don't need that context, then there are some potential advantages to the core approach in the current patch.

> Need a "LineBasedTextInputFormat"
> ---------------------------------
>
>                 Key: HADOOP-3221
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3221
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.16.2
>         Environment: All
>            Reporter: Milind Bhandarkar
>            Assignee: Amareshwari Sriramadasu
>             Fix For: 0.18.0
>
>         Attachments: patch-3221-1.txt, patch-3221.txt
>
>
> In many "pleasantly" parallel applications, each process/mapper processes the same input file (s), but with computations are controlled by different parameters.
> (Referred to as "parameter sweeps").
> One way to achieve this, is to specify a set of parameters (one set per line) as input in a control file (which is the input path to the map-reduce application, where as the input dataset is specified via a config variable in JobConf.).
> It would be great to have an InputFormat, that splits the input file such that by default, one line is fed as a value to one map task, and key could be line number. i.e. (k,v) is (LongWritable, Text).
> If user specifies the number of maps explicitly, each mapper should get a contiguous chunk of lines (so as to load balance between the mappers.)
> The location hints for the splits should not be derived from the input file, but rather, should span the whole mapred cluster.
> (Is there a way to do this without having to return an array of nSplits*nTaskTrackers ?)
> Increasing the replication of the "real" input dataset (since it will be fetched by all the nodes) is orthogonal, and one can use DistributedCache for that.
> (P.S. Please chose a better name for this InputFormat. I am not in love with  "LineBasedText" name.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-3221) Need a "LineBasedTextInputFormat"

Posted by "Amareshwari Sriramadasu (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-3221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Amareshwari Sriramadasu updated HADOOP-3221:
--------------------------------------------

    Status: Open  (was: Patch Available)

> Need a "LineBasedTextInputFormat"
> ---------------------------------
>
>                 Key: HADOOP-3221
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3221
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.16.2
>         Environment: All
>            Reporter: Milind Bhandarkar
>            Assignee: Amareshwari Sriramadasu
>             Fix For: 0.18.0
>
>         Attachments: patch-3221.txt
>
>
> In many "pleasantly" parallel applications, each process/mapper processes the same input file (s), but with computations are controlled by different parameters.
> (Referred to as "parameter sweeps").
> One way to achieve this, is to specify a set of parameters (one set per line) as input in a control file (which is the input path to the map-reduce application, where as the input dataset is specified via a config variable in JobConf.).
> It would be great to have an InputFormat, that splits the input file such that by default, one line is fed as a value to one map task, and key could be line number. i.e. (k,v) is (LongWritable, Text).
> If user specifies the number of maps explicitly, each mapper should get a contiguous chunk of lines (so as to load balance between the mappers.)
> The location hints for the splits should not be derived from the input file, but rather, should span the whole mapred cluster.
> (Is there a way to do this without having to return an array of nSplits*nTaskTrackers ?)
> Increasing the replication of the "real" input dataset (since it will be fetched by all the nodes) is orthogonal, and one can use DistributedCache for that.
> (P.S. Please chose a better name for this InputFormat. I am not in love with  "LineBasedText" name.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-3221) Need a "LineBasedTextInputFormat"

Posted by "Milind Bhandarkar (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-3221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12596726#action_12596726 ] 

Milind Bhandarkar commented on HADOOP-3221:
-------------------------------------------

Just talked with Amareshwari. She suggested specifying number of lines per mapper as a configuration variable which defaults to 1. The name of the config variable could be: mapred.line.input.format.linespermap.

With this, the splits could be computed in a single pass over the parameter file (input file).

This is a better approach, IMHO. Since the parameter file is small, user could easily do:

hadoop dfs -cat /path/to/param/list | wc -l

And do the necessary calculations before specifying the config variable. Or just let it default to one.

> Need a "LineBasedTextInputFormat"
> ---------------------------------
>
>                 Key: HADOOP-3221
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3221
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.16.2
>         Environment: All
>            Reporter: Milind Bhandarkar
>            Assignee: Amareshwari Sriramadasu
>             Fix For: 0.18.0
>
>         Attachments: patch-3221-1.txt, patch-3221.txt
>
>
> In many "pleasantly" parallel applications, each process/mapper processes the same input file (s), but with computations are controlled by different parameters.
> (Referred to as "parameter sweeps").
> One way to achieve this, is to specify a set of parameters (one set per line) as input in a control file (which is the input path to the map-reduce application, where as the input dataset is specified via a config variable in JobConf.).
> It would be great to have an InputFormat, that splits the input file such that by default, one line is fed as a value to one map task, and key could be line number. i.e. (k,v) is (LongWritable, Text).
> If user specifies the number of maps explicitly, each mapper should get a contiguous chunk of lines (so as to load balance between the mappers.)
> The location hints for the splits should not be derived from the input file, but rather, should span the whole mapred cluster.
> (Is there a way to do this without having to return an array of nSplits*nTaskTrackers ?)
> Increasing the replication of the "real" input dataset (since it will be fetched by all the nodes) is orthogonal, and one can use DistributedCache for that.
> (P.S. Please chose a better name for this InputFormat. I am not in love with  "LineBasedText" name.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-3221) Need a "LineBasedTextInputFormat"

Posted by "Amareshwari Sriramadasu (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-3221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Amareshwari Sriramadasu updated HADOOP-3221:
--------------------------------------------

    Status: Patch Available  (was: Open)

> Need a "LineBasedTextInputFormat"
> ---------------------------------
>
>                 Key: HADOOP-3221
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3221
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.16.2
>         Environment: All
>            Reporter: Milind Bhandarkar
>            Assignee: Amareshwari Sriramadasu
>             Fix For: 0.18.0
>
>         Attachments: patch-3221-1.txt, patch-3221.txt
>
>
> In many "pleasantly" parallel applications, each process/mapper processes the same input file (s), but with computations are controlled by different parameters.
> (Referred to as "parameter sweeps").
> One way to achieve this, is to specify a set of parameters (one set per line) as input in a control file (which is the input path to the map-reduce application, where as the input dataset is specified via a config variable in JobConf.).
> It would be great to have an InputFormat, that splits the input file such that by default, one line is fed as a value to one map task, and key could be line number. i.e. (k,v) is (LongWritable, Text).
> If user specifies the number of maps explicitly, each mapper should get a contiguous chunk of lines (so as to load balance between the mappers.)
> The location hints for the splits should not be derived from the input file, but rather, should span the whole mapred cluster.
> (Is there a way to do this without having to return an array of nSplits*nTaskTrackers ?)
> Increasing the replication of the "real" input dataset (since it will be fetched by all the nodes) is orthogonal, and one can use DistributedCache for that.
> (P.S. Please chose a better name for this InputFormat. I am not in love with  "LineBasedText" name.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-3221) Need a "LineBasedTextInputFormat"

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-3221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12597049#action_12597049 ] 

Hadoop QA commented on HADOOP-3221:
-----------------------------------

+1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12382090/patch-3221-2.txt
  against trunk revision 656491.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 3 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    +1 core tests.  The patch passed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2477/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2477/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2477/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2477/console

This message is automatically generated.

> Need a "LineBasedTextInputFormat"
> ---------------------------------
>
>                 Key: HADOOP-3221
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3221
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.16.2
>         Environment: All
>            Reporter: Milind Bhandarkar
>            Assignee: Amareshwari Sriramadasu
>             Fix For: 0.18.0
>
>         Attachments: patch-3221-1.txt, patch-3221-2.txt, patch-3221.txt
>
>
> In many "pleasantly" parallel applications, each process/mapper processes the same input file (s), but with computations are controlled by different parameters.
> (Referred to as "parameter sweeps").
> One way to achieve this, is to specify a set of parameters (one set per line) as input in a control file (which is the input path to the map-reduce application, where as the input dataset is specified via a config variable in JobConf.).
> It would be great to have an InputFormat, that splits the input file such that by default, one line is fed as a value to one map task, and key could be line number. i.e. (k,v) is (LongWritable, Text).
> If user specifies the number of maps explicitly, each mapper should get a contiguous chunk of lines (so as to load balance between the mappers.)
> The location hints for the splits should not be derived from the input file, but rather, should span the whole mapred cluster.
> (Is there a way to do this without having to return an array of nSplits*nTaskTrackers ?)
> Increasing the replication of the "real" input dataset (since it will be fetched by all the nodes) is orthogonal, and one can use DistributedCache for that.
> (P.S. Please chose a better name for this InputFormat. I am not in love with  "LineBasedText" name.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-3221) Need a "LineBasedTextInputFormat"

Posted by "Robert Chansler (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-3221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Chansler updated HADOOP-3221:
------------------------------------

    Release Note: Added org.apache.hadoop.mapred.lib.NLineInputFormat ,which splits N lines of input as one split. N can be specified by configuration property "mapred.line.input.format.linespermap", which defaults to 1.  (was: Adds org.apache.hadoop.mapred.lib.NLineInputFormat ,which splits N lines of input as one split. N can be specified by configuration property "mapred.line.input.format.linespermap", which defaults to 1.)

> Need a "LineBasedTextInputFormat"
> ---------------------------------
>
>                 Key: HADOOP-3221
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3221
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.16.2
>         Environment: All
>            Reporter: Milind Bhandarkar
>            Assignee: Amareshwari Sriramadasu
>             Fix For: 0.18.0
>
>         Attachments: patch-3221-1.txt, patch-3221-2.txt, patch-3221.txt
>
>
> In many "pleasantly" parallel applications, each process/mapper processes the same input file (s), but with computations are controlled by different parameters.
> (Referred to as "parameter sweeps").
> One way to achieve this, is to specify a set of parameters (one set per line) as input in a control file (which is the input path to the map-reduce application, where as the input dataset is specified via a config variable in JobConf.).
> It would be great to have an InputFormat, that splits the input file such that by default, one line is fed as a value to one map task, and key could be line number. i.e. (k,v) is (LongWritable, Text).
> If user specifies the number of maps explicitly, each mapper should get a contiguous chunk of lines (so as to load balance between the mappers.)
> The location hints for the splits should not be derived from the input file, but rather, should span the whole mapred cluster.
> (Is there a way to do this without having to return an array of nSplits*nTaskTrackers ?)
> Increasing the replication of the "real" input dataset (since it will be fetched by all the nodes) is orthogonal, and one can use DistributedCache for that.
> (P.S. Please chose a better name for this InputFormat. I am not in love with  "LineBasedText" name.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-3221) Need a "LineBasedTextInputFormat"

Posted by "Amareshwari Sriramadasu (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-3221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Amareshwari Sriramadasu updated HADOOP-3221:
--------------------------------------------

    Attachment: patch-3221.txt

Here is a patch doing the implementation in the following way:
1. Adds Classes org.apache.hadoop.mapred.LineSplit, org.apache.hadoop.mapred.NLineInputFormat and
org.apache.hadoop.mapred.NLineInputFormat.NLineRecordReader. Adds a test org.apache.hadoop.mapred.TestLineInputFormat

2. LineSplit implements InputSplit. LineSplit has number of lines and the lines themselves, sothat all the mappers do not have to fetch the same file simultaneously.

3. NLineInputFormat extends FileInputFormat. It splits the input into N lines per split, where N is derived from the number of map tasks specified (through mapred.map.tasks in JobConf). The value of N defaults to 1.

4. NLineRecordReader reads one line at time from the LineSplit. The (key, value) is (LongWritable, Text), where key is the line number and the value is the line.

Thoughts?

> Need a "LineBasedTextInputFormat"
> ---------------------------------
>
>                 Key: HADOOP-3221
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3221
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.16.2
>         Environment: All
>            Reporter: Milind Bhandarkar
>            Assignee: Amareshwari Sriramadasu
>         Attachments: patch-3221.txt
>
>
> In many "pleasantly" parallel applications, each process/mapper processes the same input file (s), but with computations are controlled by different parameters.
> (Referred to as "parameter sweeps").
> One way to achieve this, is to specify a set of parameters (one set per line) as input in a control file (which is the input path to the map-reduce application, where as the input dataset is specified via a config variable in JobConf.).
> It would be great to have an InputFormat, that splits the input file such that by default, one line is fed as a value to one map task, and key could be line number. i.e. (k,v) is (LongWritable, Text).
> If user specifies the number of maps explicitly, each mapper should get a contiguous chunk of lines (so as to load balance between the mappers.)
> The location hints for the splits should not be derived from the input file, but rather, should span the whole mapred cluster.
> (Is there a way to do this without having to return an array of nSplits*nTaskTrackers ?)
> Increasing the replication of the "real" input dataset (since it will be fetched by all the nodes) is orthogonal, and one can use DistributedCache for that.
> (P.S. Please chose a better name for this InputFormat. I am not in love with  "LineBasedText" name.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Issue Comment Edited: (HADOOP-3221) Need a "LineBasedTextInputFormat"

Posted by "Amareshwari Sriramadasu (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-3221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12597018#action_12597018 ] 

amareshwari edited comment on HADOOP-3221 at 5/14/08 11:58 PM:
---------------------------------------------------------------------------

Here is patch adding org.apache.hadoop.mapred.lib.NLineInputFormat, which splits N lines of input file as one split. 
N is specifed using config  variable "mapred.line.input.format.linespermap", defaults to 1.  In files with a number of lines not evenly divided by N, the last split constructed from that file would have number lines lines less than N.

NLineInputFormat constructs FileSplits containing N lines in each split. And Uses LineRecordReader to read the lines from the split.

      was (Author: amareshwari):
    Here is patch adding org.apache.hadoop.mapred.lib.NLineInputFormat, which splits N lines of input file as one split. 
N is specifed using config  variable "mapred.line.input.format.linespermap", defaults to 1.  In files with a number of lines not evenly divided by N, the last split constructed from that file would have number lines lines less than N.

NLineInputFormat constructs FileSplits containing N lines in each map. And Uses LineRecordReader to read the lines from the split.
  
> Need a "LineBasedTextInputFormat"
> ---------------------------------
>
>                 Key: HADOOP-3221
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3221
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.16.2
>         Environment: All
>            Reporter: Milind Bhandarkar
>            Assignee: Amareshwari Sriramadasu
>             Fix For: 0.18.0
>
>         Attachments: patch-3221-1.txt, patch-3221-2.txt, patch-3221.txt
>
>
> In many "pleasantly" parallel applications, each process/mapper processes the same input file (s), but with computations are controlled by different parameters.
> (Referred to as "parameter sweeps").
> One way to achieve this, is to specify a set of parameters (one set per line) as input in a control file (which is the input path to the map-reduce application, where as the input dataset is specified via a config variable in JobConf.).
> It would be great to have an InputFormat, that splits the input file such that by default, one line is fed as a value to one map task, and key could be line number. i.e. (k,v) is (LongWritable, Text).
> If user specifies the number of maps explicitly, each mapper should get a contiguous chunk of lines (so as to load balance between the mappers.)
> The location hints for the splits should not be derived from the input file, but rather, should span the whole mapred cluster.
> (Is there a way to do this without having to return an array of nSplits*nTaskTrackers ?)
> Increasing the replication of the "real" input dataset (since it will be fetched by all the nodes) is orthogonal, and one can use DistributedCache for that.
> (P.S. Please chose a better name for this InputFormat. I am not in love with  "LineBasedText" name.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-3221) Need a "LineBasedTextInputFormat"

Posted by "Amareshwari Sriramadasu (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-3221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Amareshwari Sriramadasu updated HADOOP-3221:
--------------------------------------------

    Status: Patch Available  (was: Open)

> Need a "LineBasedTextInputFormat"
> ---------------------------------
>
>                 Key: HADOOP-3221
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3221
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.16.2
>         Environment: All
>            Reporter: Milind Bhandarkar
>            Assignee: Amareshwari Sriramadasu
>             Fix For: 0.18.0
>
>         Attachments: patch-3221-1.txt, patch-3221-2.txt, patch-3221.txt
>
>
> In many "pleasantly" parallel applications, each process/mapper processes the same input file (s), but with computations are controlled by different parameters.
> (Referred to as "parameter sweeps").
> One way to achieve this, is to specify a set of parameters (one set per line) as input in a control file (which is the input path to the map-reduce application, where as the input dataset is specified via a config variable in JobConf.).
> It would be great to have an InputFormat, that splits the input file such that by default, one line is fed as a value to one map task, and key could be line number. i.e. (k,v) is (LongWritable, Text).
> If user specifies the number of maps explicitly, each mapper should get a contiguous chunk of lines (so as to load balance between the mappers.)
> The location hints for the splits should not be derived from the input file, but rather, should span the whole mapred cluster.
> (Is there a way to do this without having to return an array of nSplits*nTaskTrackers ?)
> Increasing the replication of the "real" input dataset (since it will be fetched by all the nodes) is orthogonal, and one can use DistributedCache for that.
> (P.S. Please chose a better name for this InputFormat. I am not in love with  "LineBasedText" name.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-3221) Need a "LineBasedTextInputFormat"

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-3221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris Douglas updated HADOOP-3221:
----------------------------------

    Hadoop Flags: [Reviewed]

+1 Patch looks good.

> Need a "LineBasedTextInputFormat"
> ---------------------------------
>
>                 Key: HADOOP-3221
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3221
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.16.2
>         Environment: All
>            Reporter: Milind Bhandarkar
>            Assignee: Amareshwari Sriramadasu
>             Fix For: 0.18.0
>
>         Attachments: patch-3221-1.txt, patch-3221-2.txt, patch-3221.txt
>
>
> In many "pleasantly" parallel applications, each process/mapper processes the same input file (s), but with computations are controlled by different parameters.
> (Referred to as "parameter sweeps").
> One way to achieve this, is to specify a set of parameters (one set per line) as input in a control file (which is the input path to the map-reduce application, where as the input dataset is specified via a config variable in JobConf.).
> It would be great to have an InputFormat, that splits the input file such that by default, one line is fed as a value to one map task, and key could be line number. i.e. (k,v) is (LongWritable, Text).
> If user specifies the number of maps explicitly, each mapper should get a contiguous chunk of lines (so as to load balance between the mappers.)
> The location hints for the splits should not be derived from the input file, but rather, should span the whole mapred cluster.
> (Is there a way to do this without having to return an array of nSplits*nTaskTrackers ?)
> Increasing the replication of the "real" input dataset (since it will be fetched by all the nodes) is orthogonal, and one can use DistributedCache for that.
> (P.S. Please chose a better name for this InputFormat. I am not in love with  "LineBasedText" name.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-3221) Need a "LineBasedTextInputFormat"

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-3221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12595523#action_12595523 ] 

Hadoop QA commented on HADOOP-3221:
-----------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12381742/patch-3221.txt
  against trunk revision 654315.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 3 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    -1 findbugs.  The patch appears to introduce 3 new Findbugs warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    +1 core tests.  The patch passed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2437/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2437/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2437/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2437/console

This message is automatically generated.

> Need a "LineBasedTextInputFormat"
> ---------------------------------
>
>                 Key: HADOOP-3221
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3221
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.16.2
>         Environment: All
>            Reporter: Milind Bhandarkar
>            Assignee: Amareshwari Sriramadasu
>             Fix For: 0.18.0
>
>         Attachments: patch-3221.txt
>
>
> In many "pleasantly" parallel applications, each process/mapper processes the same input file (s), but with computations are controlled by different parameters.
> (Referred to as "parameter sweeps").
> One way to achieve this, is to specify a set of parameters (one set per line) as input in a control file (which is the input path to the map-reduce application, where as the input dataset is specified via a config variable in JobConf.).
> It would be great to have an InputFormat, that splits the input file such that by default, one line is fed as a value to one map task, and key could be line number. i.e. (k,v) is (LongWritable, Text).
> If user specifies the number of maps explicitly, each mapper should get a contiguous chunk of lines (so as to load balance between the mappers.)
> The location hints for the splits should not be derived from the input file, but rather, should span the whole mapred cluster.
> (Is there a way to do this without having to return an array of nSplits*nTaskTrackers ?)
> Increasing the replication of the "real" input dataset (since it will be fetched by all the nodes) is orthogonal, and one can use DistributedCache for that.
> (P.S. Please chose a better name for this InputFormat. I am not in love with  "LineBasedText" name.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-3221) Need a "LineBasedTextInputFormat"

Posted by "Milind Bhandarkar (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-3221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12596714#action_12596714 ] 

Milind Bhandarkar commented on HADOOP-3221:
-------------------------------------------

FileSplit approach will work (although the replication factor for the parameter list has to be increased to 10 - similar to job.jar), as Devaraj describes it. Each map should get exactly one line, no more , no less. So, file offsets in split have to be exact for that case (not file-length / 80 or something.) Having exact offsets, pointing to each \n will make LineRecordReader reusable, in this case, right ? The Unit test needs to test this. Current OneLineInputFormat that Lohit built uses this approach, and users have been happy with it.

In case of N lines per mapper also, same approach should work, but will require two passes over the input file. First to calculate the number of lines, and then computing the splits. If number of lines is not divisible by number of mappers, its ok to have the last mapper consume less lines (although, dividing the slack among more than one mapper will be better).

> Need a "LineBasedTextInputFormat"
> ---------------------------------
>
>                 Key: HADOOP-3221
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3221
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.16.2
>         Environment: All
>            Reporter: Milind Bhandarkar
>            Assignee: Amareshwari Sriramadasu
>             Fix For: 0.18.0
>
>         Attachments: patch-3221-1.txt, patch-3221.txt
>
>
> In many "pleasantly" parallel applications, each process/mapper processes the same input file (s), but with computations are controlled by different parameters.
> (Referred to as "parameter sweeps").
> One way to achieve this, is to specify a set of parameters (one set per line) as input in a control file (which is the input path to the map-reduce application, where as the input dataset is specified via a config variable in JobConf.).
> It would be great to have an InputFormat, that splits the input file such that by default, one line is fed as a value to one map task, and key could be line number. i.e. (k,v) is (LongWritable, Text).
> If user specifies the number of maps explicitly, each mapper should get a contiguous chunk of lines (so as to load balance between the mappers.)
> The location hints for the splits should not be derived from the input file, but rather, should span the whole mapred cluster.
> (Is there a way to do this without having to return an array of nSplits*nTaskTrackers ?)
> Increasing the replication of the "real" input dataset (since it will be fetched by all the nodes) is orthogonal, and one can use DistributedCache for that.
> (P.S. Please chose a better name for this InputFormat. I am not in love with  "LineBasedText" name.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.