You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-dev@hadoop.apache.org by Aaron Kimball <aa...@cloudera.com> on 2010/02/10 23:16:17 UTC

Of Configurations and Contexts

Hi folks,

I've uncovered some behavior in Hadoop that I found surprising. I think this
represents a design flaw that I'd like to see corrected.

As we well know, decoupled components in a MapReduce job communicate
information forward through the use of Configuration instances. Every
Context (JobContext, TaskAttemptContext, MapContext, etc) carries a
Configuration object inside, accessible via getConfiguration().

The semantics of passing data from the "configuration phase" to the "run
phase" is easy; the user creates a Job on the client machine, populates its
Configuration with necessary values, and all those values will be visible in
the JobContext received in the map/reduce tasks themselves. Every task
expects to get the same view of the user-configured values here.

Similarly, in my Mapper, if during the setup() method I call
context.getConfiguration().set("foo","bar"), I expect that
context.getConfiguration.get("foo") returns "bar" during the cleanup()
method. During a map task's execution, the configuration moves "forward
linearly" through time.

The confusing part is that during the initial setup steps of the map task, a
series of different configurations are used. The noteworthy section of code
is MapTask.java in the runNewMapper() method (lines 607--650). A JobContext
is passed in; this is immediately used as the basis for a
TaskAttemptContext. The TAC is then used to initialize the InputFormat and
the RecordReader. The JobContext is then re-used to instantiate a
MapContext. The RecordReader's "initialize" method is then called with this
context, ostensibly to "switch the RR over" to the MapContext. The Mapper
itself is then run with the MapContext. Each of these two new Context
objects makes a deep copy of the Configuration present in JobContext.

The problem here is that if the InputFormat sets any Configuration settings,
the RecordReader will properly receive those during its construction -- but
the same RecordReader may be using a *different* context and thus a
*different* configuration during the actual running of the Mapper itself!
LineRecordReader in particular downcasts its TaskAttemptContext to a
MapContext at some point during its lifetime, assuming that this
initialize() call has been made and that the new context is a MapContext.
This is completely type-unsafe, and prevents LineRecordReader from being
wrapped inside another RecordReader in all cases.

Furthermore, other RecordReader initialize() methods do not do anything;
they continue to use the Context they were created with.

So now Configuration settings set in InputFormat.createRecordReader() may or
may not be present in the Configuration accessible during
RecordReader.nextKeyValue() depending on RecordReader.initialize()'s
semantics (and that of any outer RecordReader wrapping this one!).

This led to a pretty subtle bug in some code I was writing yesterday using
CombineFileInputFormat, which requires that you wrap some RecordReader
instances in others.

So my questions are:
* Is there a solid rationale for isolating the Configuration used in these
various points in time?
* If not, is there a reason to make those deep copies of the Configuration?
or can they all just share a reference to the same Configuration instance?
* If we really want deep copies, can the MapContext's copy be based off the
TaskAttemptContext's copy, so that we at least have a linear flow of
configuration settings through the execution of MapTask.runNewMapper()?

I'm happy to write a patch to make these semantics more clear. As it is, I
think the notion of needing to reinitialize the RecordReader with a
completely different context is error-prone. (CombineFileRecordReader, for
example, in its initialize() method, does not call curReader.initialize() to
initialize its child. This is a separate bug, which I'll post a patch for,
but the design of the context situation makes this more problematic than it
otherwise needs to be.)

Does anyone have any input on this situation?
Thanks,
- Aaron Kimball

Re: Of Configurations and Contexts

Posted by Aaron Kimball <aa...@cloudera.com>.

Not really. I think it's just a program flow wiring issue. See the patch at
MAPREDUCE-1486.

On Fri, Feb 12, 2010 at 8:43 AM, Chris K Wensel <ch...@wensel.net> wrote:

> Is this a side effect driven by having all the static accessors speak to
> specific Context types instead of Configuration?
>
> https://issues.apache.org/jira/browse/MAPREDUCE-851
>
> ckw
>
> On Feb 11, 2010, at 6:13 PM, Aaron Kimball wrote:
>
> > Filed https://issues.apache.org/jira/browse/MAPREDUCE-1486 with a
> concrete
> > test case.
> >
> > On Wed, Feb 10, 2010 at 5:17 PM, Aaron Kimball <aa...@cloudera.com>
> wrote:
> >
> >> This patch would allow the configuration to flow through the map task in
> a
> >> linear fashion:
> >>
> >> diff --git src/java/org/apache/hadoop/mapred/MapTask.java
> >> src/java/org/apache/hadoop/mapred/MapTask.
> >> index f889a47..4170922 100644
> >> --- src/java/org/apache/hadoop/mapred/MapTask.java
> >> +++ src/java/org/apache/hadoop/mapred/MapTask.java
> >> @@ -637,7 +637,9 @@ class MapTask extends Task {
> >>
> >>     org.apache.hadoop.mapreduce.MapContext<INKEY, INVALUE, OUTKEY,
> >> OUTVALUE>
> >>     mapContext =
> >> -      new MapContextImpl<INKEY, INVALUE, OUTKEY, OUTVALUE>(job,
> >> getTaskID(),
> >> +      new MapContextImpl<INKEY, INVALUE, OUTKEY, OUTVALUE>(
> >> +          taskContext.getConfiguration(),
> >> +          getTaskID(),
> >>           input, output,
> >>           committer,
> >>           reporter, split);
> >>
> >>
> >>
> >> On Wed, Feb 10, 2010 at 2:33 PM, Todd Lipcon <to...@cloudera.com> wrote:
> >>
> >>> On Wed, Feb 10, 2010 at 2:25 PM, Todd Lipcon <to...@cloudera.com>
> wrote:
> >>>> This is, if I'm understanding Aaron correctly, the same issue that
> >>>> makes the mapred.input.file configuration very hard to implement in
> >>>> the new API.
> >>>>
> >>>
> >>> Sorry, I should clarify this. Obviously that particular feature can be
> >>> accessed by downcasting the input split to FileSplit. But this is very
> >>> hard to deal with when the input format wants to use a different
> >>> implementation class - you end up coupling the mapper to the
> >>> inputformat in a dirty way. Or, if you want to access the input file
> >>> name from the OutputFormat, I believe you're entirely out of luck
> >>> (though I haven't looked in trunk). In the prior API where
> >>> Configuration got passed through the flow nicely, it was trivial to do
> >>> this.
> >>>
> >>> -Todd
> >>>
> >>>> -Todd
> >>>>
> >>>> On Wed, Feb 10, 2010 at 2:16 PM, Aaron Kimball <aa...@cloudera.com>
> >>> wrote:
> >>>>> Hi folks,
> >>>>>
> >>>>> I've uncovered some behavior in Hadoop that I found surprising. I
> think
> >>> this
> >>>>> represents a design flaw that I'd like to see corrected.
> >>>>>
> >>>>> As we well know, decoupled components in a MapReduce job communicate
> >>>>> information forward through the use of Configuration instances. Every
> >>>>> Context (JobContext, TaskAttemptContext, MapContext, etc) carries a
> >>>>> Configuration object inside, accessible via getConfiguration().
> >>>>>
> >>>>> The semantics of passing data from the "configuration phase" to the
> >>> "run
> >>>>> phase" is easy; the user creates a Job on the client machine,
> populates
> >>> its
> >>>>> Configuration with necessary values, and all those values will be
> >>> visible in
> >>>>> the JobContext received in the map/reduce tasks themselves. Every
> task
> >>>>> expects to get the same view of the user-configured values here.
> >>>>>
> >>>>> Similarly, in my Mapper, if during the setup() method I call
> >>>>> context.getConfiguration().set("foo","bar"), I expect that
> >>>>> context.getConfiguration.get("foo") returns "bar" during the
> cleanup()
> >>>>> method. During a map task's execution, the configuration moves
> "forward
> >>>>> linearly" through time.
> >>>>>
> >>>>> The confusing part is that during the initial setup steps of the map
> >>> task, a
> >>>>> series of different configurations are used. The noteworthy section
> of
> >>> code
> >>>>> is MapTask.java in the runNewMapper() method (lines 607--650). A
> >>> JobContext
> >>>>> is passed in; this is immediately used as the basis for a
> >>>>> TaskAttemptContext. The TAC is then used to initialize the
> InputFormat
> >>> and
> >>>>> the RecordReader. The JobContext is then re-used to instantiate a
> >>>>> MapContext. The RecordReader's "initialize" method is then called
> with
> >>> this
> >>>>> context, ostensibly to "switch the RR over" to the MapContext. The
> >>> Mapper
> >>>>> itself is then run with the MapContext. Each of these two new Context
> >>>>> objects makes a deep copy of the Configuration present in JobContext.
> >>>>>
> >>>>> The problem here is that if the InputFormat sets any Configuration
> >>> settings,
> >>>>> the RecordReader will properly receive those during its construction
> --
> >>> but
> >>>>> the same RecordReader may be using a *different* context and thus a
> >>>>> *different* configuration during the actual running of the Mapper
> >>> itself!
> >>>>> LineRecordReader in particular downcasts its TaskAttemptContext to a
> >>>>> MapContext at some point during its lifetime, assuming that this
> >>>>> initialize() call has been made and that the new context is a
> >>> MapContext.
> >>>>> This is completely type-unsafe, and prevents LineRecordReader from
> >>> being
> >>>>> wrapped inside another RecordReader in all cases.
> >>>>>
> >>>>> Furthermore, other RecordReader initialize() methods do not do
> >>> anything;
> >>>>> they continue to use the Context they were created with.
> >>>>>
> >>>>> So now Configuration settings set in InputFormat.createRecordReader()
> >>> may or
> >>>>> may not be present in the Configuration accessible during
> >>>>> RecordReader.nextKeyValue() depending on RecordReader.initialize()'s
> >>>>> semantics (and that of any outer RecordReader wrapping this one!).
> >>>>>
> >>>>> This led to a pretty subtle bug in some code I was writing yesterday
> >>> using
> >>>>> CombineFileInputFormat, which requires that you wrap some
> RecordReader
> >>>>> instances in others.
> >>>>>
> >>>>> So my questions are:
> >>>>> * Is there a solid rationale for isolating the Configuration used in
> >>> these
> >>>>> various points in time?
> >>>>> * If not, is there a reason to make those deep copies of the
> >>> Configuration?
> >>>>> or can they all just share a reference to the same Configuration
> >>> instance?
> >>>>> * If we really want deep copies, can the MapContext's copy be based
> off
> >>> the
> >>>>> TaskAttemptContext's copy, so that we at least have a linear flow of
> >>>>> configuration settings through the execution of
> MapTask.runNewMapper()?
> >>>>>
> >>>>> I'm happy to write a patch to make these semantics more clear. As it
> >>> is, I
> >>>>> think the notion of needing to reinitialize the RecordReader with a
> >>>>> completely different context is error-prone.
> (CombineFileRecordReader,
> >>> for
> >>>>> example, in its initialize() method, does not call
> >>> curReader.initialize() to
> >>>>> initialize its child. This is a separate bug, which I'll post a patch
> >>> for,
> >>>>> but the design of the context situation makes this more problematic
> >>> than it
> >>>>> otherwise needs to be.)
> >>>>>
> >>>>> Does anyone have any input on this situation?
> >>>>> Thanks,
> >>>>> - Aaron Kimball
> >>>>>
> >>>>
> >>>
> >>
> >>
>
> --
> Chris K Wensel
> chris@concurrentinc.com
> http://www.concurrentinc.com
>
>

Re: Of Configurations and Contexts

Posted by Chris K Wensel <ch...@wensel.net>.

Is this a side effect driven by having all the static accessors speak to specific Context types instead of Configuration?

https://issues.apache.org/jira/browse/MAPREDUCE-851

ckw

On Feb 11, 2010, at 6:13 PM, Aaron Kimball wrote:

> Filed https://issues.apache.org/jira/browse/MAPREDUCE-1486 with a concrete
> test case.
> 
> On Wed, Feb 10, 2010 at 5:17 PM, Aaron Kimball <aa...@cloudera.com> wrote:
> 
>> This patch would allow the configuration to flow through the map task in a
>> linear fashion:
>> 
>> diff --git src/java/org/apache/hadoop/mapred/MapTask.java
>> src/java/org/apache/hadoop/mapred/MapTask.
>> index f889a47..4170922 100644
>> --- src/java/org/apache/hadoop/mapred/MapTask.java
>> +++ src/java/org/apache/hadoop/mapred/MapTask.java
>> @@ -637,7 +637,9 @@ class MapTask extends Task {
>> 
>>     org.apache.hadoop.mapreduce.MapContext<INKEY, INVALUE, OUTKEY,
>> OUTVALUE>
>>     mapContext =
>> -      new MapContextImpl<INKEY, INVALUE, OUTKEY, OUTVALUE>(job,
>> getTaskID(),
>> +      new MapContextImpl<INKEY, INVALUE, OUTKEY, OUTVALUE>(
>> +          taskContext.getConfiguration(),
>> +          getTaskID(),
>>           input, output,
>>           committer,
>>           reporter, split);
>> 
>> 
>> 
>> On Wed, Feb 10, 2010 at 2:33 PM, Todd Lipcon <to...@cloudera.com> wrote:
>> 
>>> On Wed, Feb 10, 2010 at 2:25 PM, Todd Lipcon <to...@cloudera.com> wrote:
>>>> This is, if I'm understanding Aaron correctly, the same issue that
>>>> makes the mapred.input.file configuration very hard to implement in
>>>> the new API.
>>>> 
>>> 
>>> Sorry, I should clarify this. Obviously that particular feature can be
>>> accessed by downcasting the input split to FileSplit. But this is very
>>> hard to deal with when the input format wants to use a different
>>> implementation class - you end up coupling the mapper to the
>>> inputformat in a dirty way. Or, if you want to access the input file
>>> name from the OutputFormat, I believe you're entirely out of luck
>>> (though I haven't looked in trunk). In the prior API where
>>> Configuration got passed through the flow nicely, it was trivial to do
>>> this.
>>> 
>>> -Todd
>>> 
>>>> -Todd
>>>> 
>>>> On Wed, Feb 10, 2010 at 2:16 PM, Aaron Kimball <aa...@cloudera.com>
>>> wrote:
>>>>> Hi folks,
>>>>> 
>>>>> I've uncovered some behavior in Hadoop that I found surprising. I think
>>> this
>>>>> represents a design flaw that I'd like to see corrected.
>>>>> 
>>>>> As we well know, decoupled components in a MapReduce job communicate
>>>>> information forward through the use of Configuration instances. Every
>>>>> Context (JobContext, TaskAttemptContext, MapContext, etc) carries a
>>>>> Configuration object inside, accessible via getConfiguration().
>>>>> 
>>>>> The semantics of passing data from the "configuration phase" to the
>>> "run
>>>>> phase" is easy; the user creates a Job on the client machine, populates
>>> its
>>>>> Configuration with necessary values, and all those values will be
>>> visible in
>>>>> the JobContext received in the map/reduce tasks themselves. Every task
>>>>> expects to get the same view of the user-configured values here.
>>>>> 
>>>>> Similarly, in my Mapper, if during the setup() method I call
>>>>> context.getConfiguration().set("foo","bar"), I expect that
>>>>> context.getConfiguration.get("foo") returns "bar" during the cleanup()
>>>>> method. During a map task's execution, the configuration moves "forward
>>>>> linearly" through time.
>>>>> 
>>>>> The confusing part is that during the initial setup steps of the map
>>> task, a
>>>>> series of different configurations are used. The noteworthy section of
>>> code
>>>>> is MapTask.java in the runNewMapper() method (lines 607--650). A
>>> JobContext
>>>>> is passed in; this is immediately used as the basis for a
>>>>> TaskAttemptContext. The TAC is then used to initialize the InputFormat
>>> and
>>>>> the RecordReader. The JobContext is then re-used to instantiate a
>>>>> MapContext. The RecordReader's "initialize" method is then called with
>>> this
>>>>> context, ostensibly to "switch the RR over" to the MapContext. The
>>> Mapper
>>>>> itself is then run with the MapContext. Each of these two new Context
>>>>> objects makes a deep copy of the Configuration present in JobContext.
>>>>> 
>>>>> The problem here is that if the InputFormat sets any Configuration
>>> settings,
>>>>> the RecordReader will properly receive those during its construction --
>>> but
>>>>> the same RecordReader may be using a *different* context and thus a
>>>>> *different* configuration during the actual running of the Mapper
>>> itself!
>>>>> LineRecordReader in particular downcasts its TaskAttemptContext to a
>>>>> MapContext at some point during its lifetime, assuming that this
>>>>> initialize() call has been made and that the new context is a
>>> MapContext.
>>>>> This is completely type-unsafe, and prevents LineRecordReader from
>>> being
>>>>> wrapped inside another RecordReader in all cases.
>>>>> 
>>>>> Furthermore, other RecordReader initialize() methods do not do
>>> anything;
>>>>> they continue to use the Context they were created with.
>>>>> 
>>>>> So now Configuration settings set in InputFormat.createRecordReader()
>>> may or
>>>>> may not be present in the Configuration accessible during
>>>>> RecordReader.nextKeyValue() depending on RecordReader.initialize()'s
>>>>> semantics (and that of any outer RecordReader wrapping this one!).
>>>>> 
>>>>> This led to a pretty subtle bug in some code I was writing yesterday
>>> using
>>>>> CombineFileInputFormat, which requires that you wrap some RecordReader
>>>>> instances in others.
>>>>> 
>>>>> So my questions are:
>>>>> * Is there a solid rationale for isolating the Configuration used in
>>> these
>>>>> various points in time?
>>>>> * If not, is there a reason to make those deep copies of the
>>> Configuration?
>>>>> or can they all just share a reference to the same Configuration
>>> instance?
>>>>> * If we really want deep copies, can the MapContext's copy be based off
>>> the
>>>>> TaskAttemptContext's copy, so that we at least have a linear flow of
>>>>> configuration settings through the execution of MapTask.runNewMapper()?
>>>>> 
>>>>> I'm happy to write a patch to make these semantics more clear. As it
>>> is, I
>>>>> think the notion of needing to reinitialize the RecordReader with a
>>>>> completely different context is error-prone. (CombineFileRecordReader,
>>> for
>>>>> example, in its initialize() method, does not call
>>> curReader.initialize() to
>>>>> initialize its child. This is a separate bug, which I'll post a patch
>>> for,
>>>>> but the design of the context situation makes this more problematic
>>> than it
>>>>> otherwise needs to be.)
>>>>> 
>>>>> Does anyone have any input on this situation?
>>>>> Thanks,
>>>>> - Aaron Kimball
>>>>> 
>>>> 
>>> 
>> 
>> 

--
Chris K Wensel
chris@concurrentinc.com
http://www.concurrentinc.com

Re: Of Configurations and Contexts

Posted by Aaron Kimball <aa...@cloudera.com>.

Filed https://issues.apache.org/jira/browse/MAPREDUCE-1486 with a concrete
test case.

On Wed, Feb 10, 2010 at 5:17 PM, Aaron Kimball <aa...@cloudera.com> wrote:

> This patch would allow the configuration to flow through the map task in a
> linear fashion:
>
> diff --git src/java/org/apache/hadoop/mapred/MapTask.java
> src/java/org/apache/hadoop/mapred/MapTask.
> index f889a47..4170922 100644
> --- src/java/org/apache/hadoop/mapred/MapTask.java
> +++ src/java/org/apache/hadoop/mapred/MapTask.java
> @@ -637,7 +637,9 @@ class MapTask extends Task {
>
>      org.apache.hadoop.mapreduce.MapContext<INKEY, INVALUE, OUTKEY,
> OUTVALUE>
>      mapContext =
> -      new MapContextImpl<INKEY, INVALUE, OUTKEY, OUTVALUE>(job,
> getTaskID(),
> +      new MapContextImpl<INKEY, INVALUE, OUTKEY, OUTVALUE>(
> +          taskContext.getConfiguration(),
> +          getTaskID(),
>            input, output,
>            committer,
>            reporter, split);
>
>
>
> On Wed, Feb 10, 2010 at 2:33 PM, Todd Lipcon <to...@cloudera.com> wrote:
>
>> On Wed, Feb 10, 2010 at 2:25 PM, Todd Lipcon <to...@cloudera.com> wrote:
>> > This is, if I'm understanding Aaron correctly, the same issue that
>> > makes the mapred.input.file configuration very hard to implement in
>> > the new API.
>> >
>>
>> Sorry, I should clarify this. Obviously that particular feature can be
>> accessed by downcasting the input split to FileSplit. But this is very
>> hard to deal with when the input format wants to use a different
>> implementation class - you end up coupling the mapper to the
>> inputformat in a dirty way. Or, if you want to access the input file
>> name from the OutputFormat, I believe you're entirely out of luck
>> (though I haven't looked in trunk). In the prior API where
>> Configuration got passed through the flow nicely, it was trivial to do
>> this.
>>
>> -Todd
>>
>> > -Todd
>> >
>> > On Wed, Feb 10, 2010 at 2:16 PM, Aaron Kimball <aa...@cloudera.com>
>> wrote:
>> >> Hi folks,
>> >>
>> >> I've uncovered some behavior in Hadoop that I found surprising. I think
>> this
>> >> represents a design flaw that I'd like to see corrected.
>> >>
>> >> As we well know, decoupled components in a MapReduce job communicate
>> >> information forward through the use of Configuration instances. Every
>> >> Context (JobContext, TaskAttemptContext, MapContext, etc) carries a
>> >> Configuration object inside, accessible via getConfiguration().
>> >>
>> >> The semantics of passing data from the "configuration phase" to the
>> "run
>> >> phase" is easy; the user creates a Job on the client machine, populates
>> its
>> >> Configuration with necessary values, and all those values will be
>> visible in
>> >> the JobContext received in the map/reduce tasks themselves. Every task
>> >> expects to get the same view of the user-configured values here.
>> >>
>> >> Similarly, in my Mapper, if during the setup() method I call
>> >> context.getConfiguration().set("foo","bar"), I expect that
>> >> context.getConfiguration.get("foo") returns "bar" during the cleanup()
>> >> method. During a map task's execution, the configuration moves "forward
>> >> linearly" through time.
>> >>
>> >> The confusing part is that during the initial setup steps of the map
>> task, a
>> >> series of different configurations are used. The noteworthy section of
>> code
>> >> is MapTask.java in the runNewMapper() method (lines 607--650). A
>> JobContext
>> >> is passed in; this is immediately used as the basis for a
>> >> TaskAttemptContext. The TAC is then used to initialize the InputFormat
>> and
>> >> the RecordReader. The JobContext is then re-used to instantiate a
>> >> MapContext. The RecordReader's "initialize" method is then called with
>> this
>> >> context, ostensibly to "switch the RR over" to the MapContext. The
>> Mapper
>> >> itself is then run with the MapContext. Each of these two new Context
>> >> objects makes a deep copy of the Configuration present in JobContext.
>> >>
>> >> The problem here is that if the InputFormat sets any Configuration
>> settings,
>> >> the RecordReader will properly receive those during its construction --
>> but
>> >> the same RecordReader may be using a *different* context and thus a
>> >> *different* configuration during the actual running of the Mapper
>> itself!
>> >> LineRecordReader in particular downcasts its TaskAttemptContext to a
>> >> MapContext at some point during its lifetime, assuming that this
>> >> initialize() call has been made and that the new context is a
>> MapContext.
>> >> This is completely type-unsafe, and prevents LineRecordReader from
>> being
>> >> wrapped inside another RecordReader in all cases.
>> >>
>> >> Furthermore, other RecordReader initialize() methods do not do
>> anything;
>> >> they continue to use the Context they were created with.
>> >>
>> >> So now Configuration settings set in InputFormat.createRecordReader()
>> may or
>> >> may not be present in the Configuration accessible during
>> >> RecordReader.nextKeyValue() depending on RecordReader.initialize()'s
>> >> semantics (and that of any outer RecordReader wrapping this one!).
>> >>
>> >> This led to a pretty subtle bug in some code I was writing yesterday
>> using
>> >> CombineFileInputFormat, which requires that you wrap some RecordReader
>> >> instances in others.
>> >>
>> >> So my questions are:
>> >> * Is there a solid rationale for isolating the Configuration used in
>> these
>> >> various points in time?
>> >> * If not, is there a reason to make those deep copies of the
>> Configuration?
>> >> or can they all just share a reference to the same Configuration
>> instance?
>> >> * If we really want deep copies, can the MapContext's copy be based off
>> the
>> >> TaskAttemptContext's copy, so that we at least have a linear flow of
>> >> configuration settings through the execution of MapTask.runNewMapper()?
>> >>
>> >> I'm happy to write a patch to make these semantics more clear. As it
>> is, I
>> >> think the notion of needing to reinitialize the RecordReader with a
>> >> completely different context is error-prone. (CombineFileRecordReader,
>> for
>> >> example, in its initialize() method, does not call
>> curReader.initialize() to
>> >> initialize its child. This is a separate bug, which I'll post a patch
>> for,
>> >> but the design of the context situation makes this more problematic
>> than it
>> >> otherwise needs to be.)
>> >>
>> >> Does anyone have any input on this situation?
>> >> Thanks,
>> >> - Aaron Kimball
>> >>
>> >
>>
>
>

Re: Of Configurations and Contexts

Posted by Aaron Kimball <aa...@cloudera.com>.

This patch would allow the configuration to flow through the map task in a
linear fashion:

diff --git src/java/org/apache/hadoop/mapred/MapTask.java
src/java/org/apache/hadoop/mapred/MapTask.
index f889a47..4170922 100644
--- src/java/org/apache/hadoop/mapred/MapTask.java
+++ src/java/org/apache/hadoop/mapred/MapTask.java
@@ -637,7 +637,9 @@ class MapTask extends Task {

     org.apache.hadoop.mapreduce.MapContext<INKEY, INVALUE, OUTKEY,
OUTVALUE>
     mapContext =
-      new MapContextImpl<INKEY, INVALUE, OUTKEY, OUTVALUE>(job,
getTaskID(),
+      new MapContextImpl<INKEY, INVALUE, OUTKEY, OUTVALUE>(
+          taskContext.getConfiguration(),
+          getTaskID(),
           input, output,
           committer,
           reporter, split);


On Wed, Feb 10, 2010 at 2:33 PM, Todd Lipcon <to...@cloudera.com> wrote:

> On Wed, Feb 10, 2010 at 2:25 PM, Todd Lipcon <to...@cloudera.com> wrote:
> > This is, if I'm understanding Aaron correctly, the same issue that
> > makes the mapred.input.file configuration very hard to implement in
> > the new API.
> >
>
> Sorry, I should clarify this. Obviously that particular feature can be
> accessed by downcasting the input split to FileSplit. But this is very
> hard to deal with when the input format wants to use a different
> implementation class - you end up coupling the mapper to the
> inputformat in a dirty way. Or, if you want to access the input file
> name from the OutputFormat, I believe you're entirely out of luck
> (though I haven't looked in trunk). In the prior API where
> Configuration got passed through the flow nicely, it was trivial to do
> this.
>
> -Todd
>
> > -Todd
> >
> > On Wed, Feb 10, 2010 at 2:16 PM, Aaron Kimball <aa...@cloudera.com>
> wrote:
> >> Hi folks,
> >>
> >> I've uncovered some behavior in Hadoop that I found surprising. I think
> this
> >> represents a design flaw that I'd like to see corrected.
> >>
> >> As we well know, decoupled components in a MapReduce job communicate
> >> information forward through the use of Configuration instances. Every
> >> Context (JobContext, TaskAttemptContext, MapContext, etc) carries a
> >> Configuration object inside, accessible via getConfiguration().
> >>
> >> The semantics of passing data from the "configuration phase" to the "run
> >> phase" is easy; the user creates a Job on the client machine, populates
> its
> >> Configuration with necessary values, and all those values will be
> visible in
> >> the JobContext received in the map/reduce tasks themselves. Every task
> >> expects to get the same view of the user-configured values here.
> >>
> >> Similarly, in my Mapper, if during the setup() method I call
> >> context.getConfiguration().set("foo","bar"), I expect that
> >> context.getConfiguration.get("foo") returns "bar" during the cleanup()
> >> method. During a map task's execution, the configuration moves "forward
> >> linearly" through time.
> >>
> >> The confusing part is that during the initial setup steps of the map
> task, a
> >> series of different configurations are used. The noteworthy section of
> code
> >> is MapTask.java in the runNewMapper() method (lines 607--650). A
> JobContext
> >> is passed in; this is immediately used as the basis for a
> >> TaskAttemptContext. The TAC is then used to initialize the InputFormat
> and
> >> the RecordReader. The JobContext is then re-used to instantiate a
> >> MapContext. The RecordReader's "initialize" method is then called with
> this
> >> context, ostensibly to "switch the RR over" to the MapContext. The
> Mapper
> >> itself is then run with the MapContext. Each of these two new Context
> >> objects makes a deep copy of the Configuration present in JobContext.
> >>
> >> The problem here is that if the InputFormat sets any Configuration
> settings,
> >> the RecordReader will properly receive those during its construction --
> but
> >> the same RecordReader may be using a *different* context and thus a
> >> *different* configuration during the actual running of the Mapper
> itself!
> >> LineRecordReader in particular downcasts its TaskAttemptContext to a
> >> MapContext at some point during its lifetime, assuming that this
> >> initialize() call has been made and that the new context is a
> MapContext.
> >> This is completely type-unsafe, and prevents LineRecordReader from being
> >> wrapped inside another RecordReader in all cases.
> >>
> >> Furthermore, other RecordReader initialize() methods do not do anything;
> >> they continue to use the Context they were created with.
> >>
> >> So now Configuration settings set in InputFormat.createRecordReader()
> may or
> >> may not be present in the Configuration accessible during
> >> RecordReader.nextKeyValue() depending on RecordReader.initialize()'s
> >> semantics (and that of any outer RecordReader wrapping this one!).
> >>
> >> This led to a pretty subtle bug in some code I was writing yesterday
> using
> >> CombineFileInputFormat, which requires that you wrap some RecordReader
> >> instances in others.
> >>
> >> So my questions are:
> >> * Is there a solid rationale for isolating the Configuration used in
> these
> >> various points in time?
> >> * If not, is there a reason to make those deep copies of the
> Configuration?
> >> or can they all just share a reference to the same Configuration
> instance?
> >> * If we really want deep copies, can the MapContext's copy be based off
> the
> >> TaskAttemptContext's copy, so that we at least have a linear flow of
> >> configuration settings through the execution of MapTask.runNewMapper()?
> >>
> >> I'm happy to write a patch to make these semantics more clear. As it is,
> I
> >> think the notion of needing to reinitialize the RecordReader with a
> >> completely different context is error-prone. (CombineFileRecordReader,
> for
> >> example, in its initialize() method, does not call
> curReader.initialize() to
> >> initialize its child. This is a separate bug, which I'll post a patch
> for,
> >> but the design of the context situation makes this more problematic than
> it
> >> otherwise needs to be.)
> >>
> >> Does anyone have any input on this situation?
> >> Thanks,
> >> - Aaron Kimball
> >>
> >
>

Re: Of Configurations and Contexts

Posted by Todd Lipcon <to...@cloudera.com>.

On Wed, Feb 10, 2010 at 2:25 PM, Todd Lipcon <to...@cloudera.com> wrote:
> This is, if I'm understanding Aaron correctly, the same issue that
> makes the mapred.input.file configuration very hard to implement in
> the new API.
>

Sorry, I should clarify this. Obviously that particular feature can be
accessed by downcasting the input split to FileSplit. But this is very
hard to deal with when the input format wants to use a different
implementation class - you end up coupling the mapper to the
inputformat in a dirty way. Or, if you want to access the input file
name from the OutputFormat, I believe you're entirely out of luck
(though I haven't looked in trunk). In the prior API where
Configuration got passed through the flow nicely, it was trivial to do
this.

-Todd

> -Todd
>
> On Wed, Feb 10, 2010 at 2:16 PM, Aaron Kimball <aa...@cloudera.com> wrote:
>> Hi folks,
>>
>> I've uncovered some behavior in Hadoop that I found surprising. I think this
>> represents a design flaw that I'd like to see corrected.
>>
>> As we well know, decoupled components in a MapReduce job communicate
>> information forward through the use of Configuration instances. Every
>> Context (JobContext, TaskAttemptContext, MapContext, etc) carries a
>> Configuration object inside, accessible via getConfiguration().
>>
>> The semantics of passing data from the "configuration phase" to the "run
>> phase" is easy; the user creates a Job on the client machine, populates its
>> Configuration with necessary values, and all those values will be visible in
>> the JobContext received in the map/reduce tasks themselves. Every task
>> expects to get the same view of the user-configured values here.
>>
>> Similarly, in my Mapper, if during the setup() method I call
>> context.getConfiguration().set("foo","bar"), I expect that
>> context.getConfiguration.get("foo") returns "bar" during the cleanup()
>> method. During a map task's execution, the configuration moves "forward
>> linearly" through time.
>>
>> The confusing part is that during the initial setup steps of the map task, a
>> series of different configurations are used. The noteworthy section of code
>> is MapTask.java in the runNewMapper() method (lines 607--650). A JobContext
>> is passed in; this is immediately used as the basis for a
>> TaskAttemptContext. The TAC is then used to initialize the InputFormat and
>> the RecordReader. The JobContext is then re-used to instantiate a
>> MapContext. The RecordReader's "initialize" method is then called with this
>> context, ostensibly to "switch the RR over" to the MapContext. The Mapper
>> itself is then run with the MapContext. Each of these two new Context
>> objects makes a deep copy of the Configuration present in JobContext.
>>
>> The problem here is that if the InputFormat sets any Configuration settings,
>> the RecordReader will properly receive those during its construction -- but
>> the same RecordReader may be using a *different* context and thus a
>> *different* configuration during the actual running of the Mapper itself!
>> LineRecordReader in particular downcasts its TaskAttemptContext to a
>> MapContext at some point during its lifetime, assuming that this
>> initialize() call has been made and that the new context is a MapContext.
>> This is completely type-unsafe, and prevents LineRecordReader from being
>> wrapped inside another RecordReader in all cases.
>>
>> Furthermore, other RecordReader initialize() methods do not do anything;
>> they continue to use the Context they were created with.
>>
>> So now Configuration settings set in InputFormat.createRecordReader() may or
>> may not be present in the Configuration accessible during
>> RecordReader.nextKeyValue() depending on RecordReader.initialize()'s
>> semantics (and that of any outer RecordReader wrapping this one!).
>>
>> This led to a pretty subtle bug in some code I was writing yesterday using
>> CombineFileInputFormat, which requires that you wrap some RecordReader
>> instances in others.
>>
>> So my questions are:
>> * Is there a solid rationale for isolating the Configuration used in these
>> various points in time?
>> * If not, is there a reason to make those deep copies of the Configuration?
>> or can they all just share a reference to the same Configuration instance?
>> * If we really want deep copies, can the MapContext's copy be based off the
>> TaskAttemptContext's copy, so that we at least have a linear flow of
>> configuration settings through the execution of MapTask.runNewMapper()?
>>
>> I'm happy to write a patch to make these semantics more clear. As it is, I
>> think the notion of needing to reinitialize the RecordReader with a
>> completely different context is error-prone. (CombineFileRecordReader, for
>> example, in its initialize() method, does not call curReader.initialize() to
>> initialize its child. This is a separate bug, which I'll post a patch for,
>> but the design of the context situation makes this more problematic than it
>> otherwise needs to be.)
>>
>> Does anyone have any input on this situation?
>> Thanks,
>> - Aaron Kimball
>>
>

Re: Of Configurations and Contexts

Posted by Todd Lipcon <to...@cloudera.com>.

This is, if I'm understanding Aaron correctly, the same issue that
makes the mapred.input.file configuration very hard to implement in
the new API.

-Todd

On Wed, Feb 10, 2010 at 2:16 PM, Aaron Kimball <aa...@cloudera.com> wrote:
> Hi folks,
>
> I've uncovered some behavior in Hadoop that I found surprising. I think this
> represents a design flaw that I'd like to see corrected.
>
> As we well know, decoupled components in a MapReduce job communicate
> information forward through the use of Configuration instances. Every
> Context (JobContext, TaskAttemptContext, MapContext, etc) carries a
> Configuration object inside, accessible via getConfiguration().
>
> The semantics of passing data from the "configuration phase" to the "run
> phase" is easy; the user creates a Job on the client machine, populates its
> Configuration with necessary values, and all those values will be visible in
> the JobContext received in the map/reduce tasks themselves. Every task
> expects to get the same view of the user-configured values here.
>
> Similarly, in my Mapper, if during the setup() method I call
> context.getConfiguration().set("foo","bar"), I expect that
> context.getConfiguration.get("foo") returns "bar" during the cleanup()
> method. During a map task's execution, the configuration moves "forward
> linearly" through time.
>
> The confusing part is that during the initial setup steps of the map task, a
> series of different configurations are used. The noteworthy section of code
> is MapTask.java in the runNewMapper() method (lines 607--650). A JobContext
> is passed in; this is immediately used as the basis for a
> TaskAttemptContext. The TAC is then used to initialize the InputFormat and
> the RecordReader. The JobContext is then re-used to instantiate a
> MapContext. The RecordReader's "initialize" method is then called with this
> context, ostensibly to "switch the RR over" to the MapContext. The Mapper
> itself is then run with the MapContext. Each of these two new Context
> objects makes a deep copy of the Configuration present in JobContext.
>
> The problem here is that if the InputFormat sets any Configuration settings,
> the RecordReader will properly receive those during its construction -- but
> the same RecordReader may be using a *different* context and thus a
> *different* configuration during the actual running of the Mapper itself!
> LineRecordReader in particular downcasts its TaskAttemptContext to a
> MapContext at some point during its lifetime, assuming that this
> initialize() call has been made and that the new context is a MapContext.
> This is completely type-unsafe, and prevents LineRecordReader from being
> wrapped inside another RecordReader in all cases.
>
> Furthermore, other RecordReader initialize() methods do not do anything;
> they continue to use the Context they were created with.
>
> So now Configuration settings set in InputFormat.createRecordReader() may or
> may not be present in the Configuration accessible during
> RecordReader.nextKeyValue() depending on RecordReader.initialize()'s
> semantics (and that of any outer RecordReader wrapping this one!).
>
> This led to a pretty subtle bug in some code I was writing yesterday using
> CombineFileInputFormat, which requires that you wrap some RecordReader
> instances in others.
>
> So my questions are:
> * Is there a solid rationale for isolating the Configuration used in these
> various points in time?
> * If not, is there a reason to make those deep copies of the Configuration?
> or can they all just share a reference to the same Configuration instance?
> * If we really want deep copies, can the MapContext's copy be based off the
> TaskAttemptContext's copy, so that we at least have a linear flow of
> configuration settings through the execution of MapTask.runNewMapper()?
>
> I'm happy to write a patch to make these semantics more clear. As it is, I
> think the notion of needing to reinitialize the RecordReader with a
> completely different context is error-prone. (CombineFileRecordReader, for
> example, in its initialize() method, does not call curReader.initialize() to
> initialize its child. This is a separate bug, which I'll post a patch for,
> but the design of the context situation makes this more problematic than it
> otherwise needs to be.)
>
> Does anyone have any input on this situation?
> Thanks,
> - Aaron Kimball
>