You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by "Runping Qi (JIRA)" <ji...@apache.org> on 2006/03/31 17:31:40 UTC

[jira] Created: (HADOOP-115) Hadoop should allow the user to use SequentialFileOutputformat as the output format and to choose key/value classes that are different from those for map output.

Hadoop should allow the user to use SequentialFileOutputformat as the output format and to choose key/value classes that are different from those for map output.
-------------------------------------------------------------------------------------------------------------------------------------------------------------------

Key: HADOOP-115
URL: http://issues.apache.org/jira/browse/HADOOP-115
Project: Hadoop
Type: Improvement
Components: mapred
Reporter: Runping Qi

When map tasks write intermediate data out, they always use SequencialFile RecordWriter with key/value classes from the job object.

When the reducers write the final results out, its output format is obtained from the job object. By default, it is TextOutputFormat, and no conflicts.
However, if one wants to use SequencialFileFormat for the final results, then the key/value classes are also obtained from the job object, the same as the map tasks' output. Now we have a problem. It is impossible for the map outputs and reducer outputs use different key/value classes, if one wants the reducers generate outputs in SequentialFileFormat.

A simple fix would be to add another two attributes to JobConf class: mapOutputLeyClass and mapOutputValueClass. That allows the user to have different key/value classes for the intermediate and final outputs.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira

Re: [jira] Commented: (HADOOP-115) Hadoop should allow the user to use SequentialFileOutputformat as the output format and to choose key/value classes that are different from those for map output.

Posted by Doug Cutting <cu...@apache.org>.

Konstantin Shvachko wrote:
> Here is another example, that I dealt with.
> I wanted to use different value types (long, float or string) for both 
> map and reduce tasks,
> depending on the actual key values. So the solution was to encode the 
> value type into the key value.
> I used keys of the form
> l:<name> - indicating the value type is expected to be long
> f:<name> - value type is expected to be float
> s:<name> - value is a string
> The example is under HADOOP-95.
> Thought somebody might find it practical.

On a related note, ObjectWritable can be used as input or output type, 
and can wrap any Writable class, thus permitting polymorphic inputs and 
outputs.  Nutch uses this to, e.g., combine a URL's incoming anchor 
texts and its content when indexing.  The input type is ObjectWritable, 
and the indexer's InputFormat wraps values from a variety of files.  The 
indexing reducer can then use the 'instanceof' operator to determine how 
to process each input value.  To be more object-oriented, one could have 
all of these classes implement some Indexable interface whose methods 
are invoked when reducing.

Doug

Re: [jira] Commented: (HADOOP-115) Hadoop should allow the user to use SequentialFileOutputformat as the output format and to choose key/value classes that are different from those for map output.

Posted by Konstantin Shvachko <sh...@yahoo-inc.com>.

I agree that the framework must be as general as possible. Which means 
one should use some simple
data structure for keys and value, like string or BytesWritable.
Also nothing prevents us from implementing other types on top of the 
framework as an optional
layer of higher level API.

Here is another example, that I dealt with.
I wanted to use different value types (long, float or string) for both 
map and reduce tasks,
depending on the actual key values. So the solution was to encode the 
value type into the key value.
I used keys of the form
l:<name> - indicating the value type is expected to be long
f:<name> - value type is expected to be float
s:<name> - value is a string
The example is under HADOOP-95.
Thought somebody might find it practical.

--Konstantin

Doug Cutting wrote:

> Eric Baldeschwieler wrote:
>
>> An observation...  this whole thread is about limits caused by type  
>> safety.  Interestingly, the other implementation of map-reduce does  
>> not support types at all.  Everything is a string.
>>
>> So I agree that our departure from the paper is the problem.  ;-)
>
>
> A corollary is that one could simply use BytesWritable for all one's 
> keys and values, altering only one's WritableComparator 
> implementation, and one would not encounter this problem.  The use of 
> types in Hadoop is thus an optional feature.  One could even layer a 
> different type system on top of BytesWritable that exhibits the 
> desired properties.
>
>> I'm comfortable letting this lie for a while.  But I predict we've  
>> not heard the last of it.
>
>
> Owen seems to be picking it up, which is fine by me.
>
> Doug
>
>
>

Re: [jira] Commented: (HADOOP-115) Hadoop should allow the user to use SequentialFileOutputformat as the output format and to choose key/value classes that are different from those for map output.

Posted by Doug Cutting <cu...@apache.org>.

Eric Baldeschwieler wrote:
> An observation...  this whole thread is about limits caused by type  
> safety.  Interestingly, the other implementation of map-reduce does  not 
> support types at all.  Everything is a string.
> 
> So I agree that our departure from the paper is the problem.  ;-)

A corollary is that one could simply use BytesWritable for all one's 
keys and values, altering only one's WritableComparator implementation, 
and one would not encounter this problem.  The use of types in Hadoop is 
thus an optional feature.  One could even layer a different type system 
on top of BytesWritable that exhibits the desired properties.

> I'm comfortable letting this lie for a while.  But I predict we've  not 
> heard the last of it.

Owen seems to be picking it up, which is fine by me.

Doug

Re: [jira] Commented: (HADOOP-115) Hadoop should allow the user to use SequentialFileOutputformat as the output format and to choose key/value classes that are different from those for map output.

Posted by Eric Baldeschwieler <er...@yahoo-inc.com>.

An observation...  this whole thread is about limits caused by type  
safety.  Interestingly, the other implementation of map-reduce does  
not support types at all.  Everything is a string.

So I agree that our departure from the paper is the problem.  ;-)

I'm comfortable letting this lie for a while.  But I predict we've  
not heard the last of it.

On Apr 2, 2006, at 10:29 PM, Doug Cutting wrote:

> Runping Qi wrote:
>> The argument of using local combiners is interesting. To me,  
>> combiner class
>> is just another layer of transformer.  It does not mean that the  
>> combiner
>> class has to be the same as the reducer class. The only criteria  
>> is that
>> they meet the associate rule:  	Let L1, L2, ..., Ln and K1,  
>> K2, .., Km be two partitions of S, then 	Reduce(list(Combiner(L1),  
>> Combiner(L2),..., Combiner(Ln))) and 	Reduce(list(Combiner(K1),  
>> Combiner(K2), ..., Combiner(Km)) are the
>> same.
>> A special (maybe very common) scenario is that combiner and  
>> reducer are the
>> same class and reduce function is associate. However, this needs  
>> not to be
>> the case in general. And the class of the reduce outputs need not  
>> to be the
>> same as that of the combiner, if the combiner and the reducer are  
>> not the
>> same class.
>
> This indeed may be be an intriguing generalization of the MapReduce  
> model.  But it does add more possible failure modes.  At present we  
> have far too few unit tests for the existing, simpler MapReduce  
> model, and the platform is still shakey.  Thus I am reluctant to  
> spend a lot of extending the model in ways that are not absolutely  
> essential.
>
> My goal is for Hadoop to be widely used.  I do not feel that the  
> power of the MapReduce model is currently a primary bottleneck to  
> wider adoption.  The larger issues we face are performance,  
> reliability, scalability and documentation.
>
> If I am to commit a patch, then I must feel that I can support and  
> maintain it, that it fits within my priorities.  Otherwise, if it  
> causes problems that I don't have time to attend to (even if this  
> only means reviewing and testing fixes submitted by others) then  
> the quality of the system will decrease, a vector we must avoid.
>
> Currently we have just four committers on Hadoop.  For Mike and  
> Andrzej, Nutch is a secondary effort.  Owen has been voted in as a  
> Hadoop committer, but his paperwork is not yet complete.  So I am  
> the bottleneck.  I spend a lot of time on annoying yet critical  
> issues like making sure that recent extensions to Hadoop don't  
> break Nutch running in pseudo-distributed mode on Windows.
>
> I don't particularly like things this way, but that's where we are  
> right now.  The best way to get out of here is for folks who'd like  
> to be committers to submit high-quality, well documented, well- 
> formatted, non-disruptive, unit-test-bearing patches that are easy  
> for me to apply and make Hadoop easier to use and more reliable,  
> thus earning points towards becoming committers.  If we have more  
> committers then we should be able to advance with confidence on  
> more fronts in parallel.
>
> Doug

Re: [jira] Commented: (HADOOP-115) Hadoop should allow the user to use SequentialFileOutputformat as the output format and to choose key/value classes that are different from those for map output.

Posted by Doug Cutting <cu...@apache.org>.

Runping Qi wrote:
> The argument of using local combiners is interesting. To me, combiner class
> is just another layer of transformer.  It does not mean that the combiner
> class has to be the same as the reducer class. The only criteria is that
> they meet the associate rule:  
> 	Let L1, L2, ..., Ln and K1, K2, .., Km be two partitions of S, then 
> 	Reduce(list(Combiner(L1), Combiner(L2),..., Combiner(Ln))) and 
> 	Reduce(list(Combiner(K1), Combiner(K2), ..., Combiner(Km)) are the
> same.
> 
> A special (maybe very common) scenario is that combiner and reducer are the
> same class and reduce function is associate. However, this needs not to be
> the case in general. And the class of the reduce outputs need not to be the
> same as that of the combiner, if the combiner and the reducer are not the
> same class.

This indeed may be be an intriguing generalization of the MapReduce 
model.  But it does add more possible failure modes.  At present we have 
far too few unit tests for the existing, simpler MapReduce model, and 
the platform is still shakey.  Thus I am reluctant to spend a lot of 
extending the model in ways that are not absolutely essential.

My goal is for Hadoop to be widely used.  I do not feel that the power 
of the MapReduce model is currently a primary bottleneck to wider 
adoption.  The larger issues we face are performance, reliability, 
scalability and documentation.

If I am to commit a patch, then I must feel that I can support and 
maintain it, that it fits within my priorities.  Otherwise, if it causes 
problems that I don't have time to attend to (even if this only means 
reviewing and testing fixes submitted by others) then the quality of the 
system will decrease, a vector we must avoid.

Currently we have just four committers on Hadoop.  For Mike and Andrzej, 
Nutch is a secondary effort.  Owen has been voted in as a Hadoop 
committer, but his paperwork is not yet complete.  So I am the 
bottleneck.  I spend a lot of time on annoying yet critical issues like 
making sure that recent extensions to Hadoop don't break Nutch running 
in pseudo-distributed mode on Windows.

I don't particularly like things this way, but that's where we are right 
now.  The best way to get out of here is for folks who'd like to be 
committers to submit high-quality, well documented, well-formatted, 
non-disruptive, unit-test-bearing patches that are easy for me to apply 
and make Hadoop easier to use and more reliable, thus earning points 
towards becoming committers.  If we have more committers then we should 
be able to advance with confidence on more fronts in parallel.

Doug

RE: [jira] Commented: (HADOOP-115) Hadoop should allow the user to use SequentialFileOutputformat as the output format and to choose key/value classes that are different from those for map output.

Posted by Runping Qi <ru...@yahoo-inc.com>.

The argument of using local combiners is interesting. To me, combiner class
is just another layer of transformer.  It does not mean that the combiner
class has to be the same as the reducer class. The only criteria is that
they meet the associate rule:  
	Let L1, L2, ..., Ln and K1, K2, .., Km be two partitions of S, then 
	Reduce(list(Combiner(L1), Combiner(L2),..., Combiner(Ln))) and 
	Reduce(list(Combiner(K1), Combiner(K2), ..., Combiner(Km)) are the
same.

A special (maybe very common) scenario is that combiner and reducer are the
same class and reduce function is associate. However, this needs not to be
the case in general. And the class of the reduce outputs need not to be the
same as that of the combiner, if the combiner and the reducer are not the
same class.

Runping

-----Original Message-----
From: Doug Cutting [mailto:cutting@apache.org] 
Sent: Sunday, April 02, 2006 1:30 PM
To: hadoop-dev@lucene.apache.org
Subject: Re: [jira] Commented: (HADOOP-115) Hadoop should allow the user to
use SequentialFileOutputformat as the output format and to choose key/value
classes that are different from those for map output.

Eric Baldeschwieler wrote:
> I can not think of a case where this proposed extension complicates  
> code or reduces compressibility.  Since it is backwards compatible  with 
> your desired API, purists can simply ignore the option.

It makes the insertion of a combiner no longer transparent.  The reducer 
would have to know whether a combiner had been used in order to know how 
to process the map output.

In general this seems like a micro-optimization.  It saves little code. 
  Instead of writing 'collector.collect(key, new List(value))' one could 
write 'collector.collect(key, value)'.

Taking this to its logical extreme, in the classic word-count use of 
MapReduce, why should one have to emit ones for the map values?  Why 
have a value at all?  Why not add a collect(key) method, then permit 
reducers to be passed an iterator which returns null for all values 
where collect(key) was called.  That would save a little code and make 
the intermediate data a bit smaller.  So should we do it?  I'd argue not.

Doug

Re: [jira] Commented: (HADOOP-115) Hadoop should allow the user to use SequentialFileOutputformat as the output format and to choose key/value classes that are different from those for map output.

Posted by Doug Cutting <cu...@apache.org>.

Eric Baldeschwieler wrote:
> I can not think of a case where this proposed extension complicates  
> code or reduces compressibility.  Since it is backwards compatible  with 
> your desired API, purists can simply ignore the option.

It makes the insertion of a combiner no longer transparent.  The reducer 
would have to know whether a combiner had been used in order to know how 
to process the map output.

In general this seems like a micro-optimization.  It saves little code. 
  Instead of writing 'collector.collect(key, new List(value))' one could 
write 'collector.collect(key, value)'.

Taking this to its logical extreme, in the classic word-count use of 
MapReduce, why should one have to emit ones for the map values?  Why 
have a value at all?  Why not add a collect(key) method, then permit 
reducers to be passed an iterator which returns null for all values 
where collect(key) was called.  That would save a little code and make 
the intermediate data a bit smaller.  So should we do it?  I'd argue not.

Doug

Re: [jira] Commented: (HADOOP-115) Hadoop should allow the user to use SequentialFileOutputformat as the output format and to choose key/value classes that are different from those for map output.

Posted by Eric Baldeschwieler <er...@yahoo-inc.com>.

Sure there is something wrong with requiring extra map-reduce  
passes.  Without significant development it can be very expensive  
(shuffling, sorting and rewriting your whole output set can be a  
significant burden).  Pointlessly so, since the extension is clear,  
safe and easier to explain then the restriction.

I think we can all agree that a project goal is to keep the design as  
simple and focused as possible.  I'd find an argument against an  
extension based on those goals pretty compelling, but the lack of a  
feature in a paper from google doesn't seem like a compelling reason  
to reject something.  The hadoop approach to many decisions varies  
from google's, this is not a bad thing.

I can not think of a case where this proposed extension complicates  
code or reduces compressibility.  Since it is backwards compatible  
with your desired API, purists can simply ignore the option.

On Apr 1, 2006, at 9:29 AM, Andrew McNabb wrote:

> On Sat, Apr 01, 2006 at 06:19:27PM +0100, Teppo Kurki (JIRA) wrote:
>>
>> My original post about the issue gives a simple case that would  
>> benefit from this: http://www.mail-archive.com/hadoop-user% 
>> 40lucene.apache.org/msg00073.html
>>
>
> This should be done in two map-reduce phases.  There's nothing wrong
> with running two phases (or 10,000).
>
> -- 
> Andrew McNabb
> http://www.mcnabbs.org/andrew/
> PGP Fingerprint: 8A17 B57C 6879 1863 DE55  8012 AB4D 6098 8826 6868

Re: [jira] Commented: (HADOOP-115) Hadoop should allow the user to use SequentialFileOutputformat as the output format and to choose key/value classes that are different from those for map output.

Posted by Andrew McNabb <am...@mcnabbs.org>.

On Sat, Apr 01, 2006 at 06:19:27PM +0100, Teppo Kurki (JIRA) wrote:
> 
> My original post about the issue gives a simple case that would benefit from this: http://www.mail-archive.com/hadoop-user%40lucene.apache.org/msg00073.html
> 

This should be done in two map-reduce phases.  There's nothing wrong
with running two phases (or 10,000).

-- 
Andrew McNabb
http://www.mcnabbs.org/andrew/
PGP Fingerprint: 8A17 B57C 6879 1863 DE55  8012 AB4D 6098 8826 6868

Re: [jira] Reopened: (HADOOP-115) Hadoop should allow the user to use SequentialFileOutputformat as the output format and to choose key/value classes that are different from those for map output.

Posted by Andrew McNabb <am...@mcnabbs.org>.

On Sat, Apr 01, 2006 at 09:21:26AM +0100, Owen O'Malley (JIRA) wrote:
> 
> Let's reopen this. I've had discussions with Runping today, and it seems to me that:

> 3. It is less clear that we should allow the user to change the key
> type in the reduce, but since the current API does allow them to
> change the value (if not the type), I think we should be consistent
> and allow a type change too.

I strongly disagree.  I think it's unnecessary, and I think it breaks
the model too much.  Google has many thousands of map reduce
applications, and they haven't broken the model yet.  I don't really
care about output formats, but I think we're just asking for trouble if
we allow the type of the reduce output to be different from the type of
the map output.

-- 
Andrew McNabb
http://www.mcnabbs.org/andrew/
PGP Fingerprint: 8A17 B57C 6879 1863 DE55  8012 AB4D 6098 8826 6868

Re: [jira] Resolved: (HADOOP-115) Hadoop should allow the user to use SequentialFileOutputformat as the output format and to choose key/value classes that are different from those for map output.

Posted by Andrew McNabb <am...@mcnabbs.org>.

On Fri, Mar 31, 2006 at 06:17:48PM +0100, Doug Cutting (JIRA) wrote:
> 
> This is the way it is supposed to work.  From the MapReduce paper:
> 
>   map (k1,v1) ! list(k2,v2)
>   reduce (k2,list(v2)) ! list(v2)
> 
>   I.e., the input keys and values are drawn from a different
>   domain than the output keys and values. Furthermore,
>   the intermediate keys and values are from the same domain
>   as the output keys and values.
> 
> I am closing this bug.  If someone feels strongly that we should extend the MapReduce model in this direction, then we can re-open it.  But, as it stands, things work as intended.
> 

I agree with you.  If you need to change keys, run a second map-reduce
phase.

-- 
Andrew McNabb
http://www.mcnabbs.org/andrew/
PGP Fingerprint: 8A17 B57C 6879 1863 DE55  8012 AB4D 6098 8826 6868

[jira] Commented: (HADOOP-115) Hadoop should allow the user to use SequentialFileOutputformat as the output format and to choose key/value classes that are different from those for map output.

Posted by "eric baldeschwieler (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/HADOOP-115?page=comments#action_12372783 ] 

eric baldeschwieler commented on HADOOP-115:
--------------------------------------------

Ah!  Are you suggesting that getOutput* describes the final classes output from reduce always and if you don't set the new variables MapOutput* it also controls the map?  That is clear enough and backwards compatible.  I just was not looking at it the right way!


> Hadoop should allow the user to use SequentialFileOutputformat as the output format and to choose  key/value classes that are different from those for map output.
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>          Key: HADOOP-115
>          URL: http://issues.apache.org/jira/browse/HADOOP-115
>      Project: Hadoop
>         Type: Improvement
>   Components: mapred
>     Reporter: Runping Qi

>
> When map tasks write intermediate data out, they always use SequencialFile RecordWriter with key/value classes from the job object.
> When the reducers write the final results out, its output format is obtained from the job object. By default, it is TextOutputFormat, and no conflicts.
> However, if one wants to use SequencialFileFormat for the final results, then the key/value classes are also obtained from the job object, the same as the map tasks' output. Now we have a problem. It is impossible for the map outputs and reducer outputs use different key/value classes, if one wants the reducers generate outputs in SequentialFileFormat.
> A simple fix would be to add another two attributes to JobConf class: mapOutputLeyClass and mapOutputValueClass. That allows the user to have different key/value classes for the intermediate and final outputs.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Resolved: (HADOOP-115) Hadoop should allow the user to use SequentialFileOutputformat as the output format and to choose key/value classes that are different from those for map output.

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/HADOOP-115?page=all ]
     
Doug Cutting resolved HADOOP-115:
---------------------------------

    Resolution: Won't Fix

This is the way it is supposed to work.  From the MapReduce paper:

  map (k1,v1) ! list(k2,v2)
  reduce (k2,list(v2)) ! list(v2)

  I.e., the input keys and values are drawn from a different
  domain than the output keys and values. Furthermore,
  the intermediate keys and values are from the same domain
  as the output keys and values.

I am closing this bug.  If someone feels strongly that we should extend the MapReduce model in this direction, then we can re-open it.  But, as it stands, things work as intended.

> Hadoop should allow the user to use SequentialFileOutputformat as the output format and to choose  key/value classes that are different from those for map output.
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>          Key: HADOOP-115
>          URL: http://issues.apache.org/jira/browse/HADOOP-115
>      Project: Hadoop
>         Type: Improvement
>   Components: mapred
>     Reporter: Runping Qi

>
> When map tasks write intermediate data out, they always use SequencialFile RecordWriter with key/value classes from the job object.
> When the reducers write the final results out, its output format is obtained from the job object. By default, it is TextOutputFormat, and no conflicts.
> However, if one wants to use SequencialFileFormat for the final results, then the key/value classes are also obtained from the job object, the same as the map tasks' output. Now we have a problem. It is impossible for the map outputs and reducer outputs use different key/value classes, if one wants the reducers generate outputs in SequentialFileFormat.
> A simple fix would be to add another two attributes to JobConf class: mapOutputLeyClass and mapOutputValueClass. That allows the user to have different key/value classes for the intermediate and final outputs.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Updated: (HADOOP-115) Hadoop should allow the user to use SequentialFileOutputformat as the output format and to choose key/value classes that are different from those for map output.

Posted by "Teppo Kurki (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/HADOOP-115?page=all ]

Teppo Kurki updated HADOOP-115:
-------------------------------

    Attachment: hadoop-115_tk.patch

(1) adds JobConf.set/getMapKey/ValueClass
(2) changes in ReduceTask so that the Map phase classes are used in append & sort
(3) explicit key and value class specification in OutputFormat.getRecordWriter so that they are not automatically pulled out of JobConf

Tests run ok, but I haven't added tests for different Map/Reduce key/value classes.

Owen's suggestion on automatically typing the OutputFormat when the first record is written would cause changes in fewer classes, but I had this already working...



> Hadoop should allow the user to use SequentialFileOutputformat as the output format and to choose  key/value classes that are different from those for map output.
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>          Key: HADOOP-115
>          URL: http://issues.apache.org/jira/browse/HADOOP-115
>      Project: Hadoop
>         Type: Improvement

>   Components: mapred
>     Reporter: Runping Qi
>  Attachments: hadoop-115_tk.patch
>
> When map tasks write intermediate data out, they always use SequencialFile RecordWriter with key/value classes from the job object.
> When the reducers write the final results out, its output format is obtained from the job object. By default, it is TextOutputFormat, and no conflicts.
> However, if one wants to use SequencialFileFormat for the final results, then the key/value classes are also obtained from the job object, the same as the map tasks' output. Now we have a problem. It is impossible for the map outputs and reducer outputs use different key/value classes, if one wants the reducers generate outputs in SequentialFileFormat.
> A simple fix would be to add another two attributes to JobConf class: mapOutputLeyClass and mapOutputValueClass. That allows the user to have different key/value classes for the intermediate and final outputs.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (HADOOP-115) Hadoop should allow the user to use SequentialFileOutputformat as the output format and to choose key/value classes that are different from those for map output.

Posted by "Teppo Kurki (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/HADOOP-115?page=comments#action_12373780 ] 

Teppo Kurki commented on HADOOP-115:
------------------------------------

You're right about the getMapOutputComparatorClass and the needless interface change.

Automatic/deferred SequenceFile typing based on the first record doesn't seem feasible because SequenceFile.Writer.append(byte[]...) is used here and there.

I already had a second go at this with less changes, but I'll keep it to myself until I can put some unit tests together.

At least I'm learning about how things work under the hood.





> Hadoop should allow the user to use SequentialFileOutputformat as the output format and to choose  key/value classes that are different from those for map output.
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>          Key: HADOOP-115
>          URL: http://issues.apache.org/jira/browse/HADOOP-115
>      Project: Hadoop
>         Type: Improvement

>   Components: mapred
>     Reporter: Runping Qi
>  Attachments: hadoop-115_tk.patch
>
> When map tasks write intermediate data out, they always use SequencialFile RecordWriter with key/value classes from the job object.
> When the reducers write the final results out, its output format is obtained from the job object. By default, it is TextOutputFormat, and no conflicts.
> However, if one wants to use SequencialFileFormat for the final results, then the key/value classes are also obtained from the job object, the same as the map tasks' output. Now we have a problem. It is impossible for the map outputs and reducer outputs use different key/value classes, if one wants the reducers generate outputs in SequentialFileFormat.
> A simple fix would be to add another two attributes to JobConf class: mapOutputLeyClass and mapOutputValueClass. That allows the user to have different key/value classes for the intermediate and final outputs.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Resolved: (HADOOP-115) permit reduce input types to differ from reduce output types

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/HADOOP-115?page=all ]
     
Doug Cutting resolved HADOOP-115:
---------------------------------

    Fix Version: 0.3
     Resolution: Fixed

Your patch has some strange formatting, and some spurious whitespace changes.  I fixed most of these.  It would also be best to have a unit test that uses this feature.  But I'm tired of this issue, and it should do no harm, so I committed it.

> permit reduce input types to differ from reduce output types
> ------------------------------------------------------------
>
>          Key: HADOOP-115
>          URL: http://issues.apache.org/jira/browse/HADOOP-115
>      Project: Hadoop
>         Type: New Feature

>   Components: mapred
>     Reporter: Runping Qi
>     Assignee: Runping Qi
>      Fix For: 0.3
>  Attachments: hadoop-115_ReduceTask.patch, hadoop-115_tk.patch, patch_115.txt.2006_05_16
>
> When map tasks write intermediate data out, they always use SequencialFile RecordWriter with key/value classes from the job object.
> When the reducers write the final results out, its output format is obtained from the job object. By default, it is TextOutputFormat, and no conflicts.
> However, if one wants to use SequencialFileFormat for the final results, then the key/value classes are also obtained from the job object, the same as the map tasks' output. Now we have a problem. It is impossible for the map outputs and reducer outputs use different key/value classes, if one wants the reducers generate outputs in SequentialFileFormat.
> A simple fix would be to add another two attributes to JobConf class: mapOutputLeyClass and mapOutputValueClass. That allows the user to have different key/value classes for the intermediate and final outputs.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Closed: (HADOOP-115) permit reduce input types to differ from reduce output types

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/HADOOP-115?page=all ]
     
Doug Cutting closed HADOOP-115:
-------------------------------


> permit reduce input types to differ from reduce output types
> ------------------------------------------------------------
>
>          Key: HADOOP-115
>          URL: http://issues.apache.org/jira/browse/HADOOP-115
>      Project: Hadoop
>         Type: New Feature

>   Components: mapred
>     Reporter: Runping Qi
>     Assignee: Runping Qi
>      Fix For: 0.3.0
>  Attachments: hadoop-115_ReduceTask.patch, hadoop-115_tk.patch, patch_115.txt.2006_05_16
>
> When map tasks write intermediate data out, they always use SequencialFile RecordWriter with key/value classes from the job object.
> When the reducers write the final results out, its output format is obtained from the job object. By default, it is TextOutputFormat, and no conflicts.
> However, if one wants to use SequencialFileFormat for the final results, then the key/value classes are also obtained from the job object, the same as the map tasks' output. Now we have a problem. It is impossible for the map outputs and reducer outputs use different key/value classes, if one wants the reducers generate outputs in SequentialFileFormat.
> A simple fix would be to add another two attributes to JobConf class: mapOutputLeyClass and mapOutputValueClass. That allows the user to have different key/value classes for the intermediate and final outputs.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (HADOOP-115) Hadoop should allow the user to use SequentialFileOutputformat as the output format and to choose key/value classes that are different from those for map output.

Posted by "Owen O'Malley (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/HADOOP-115?page=comments#action_12373714 ] 

Owen O'Malley commented on HADOOP-115:
--------------------------------------

Your patch adds {set,get}MapOutputComparatorClass(), which aren't needed, because the map outputs are the only ones that are compared. 

I don't think you are handling the combiners.

Did you need to change the interfaces to explicitly pass around the key/value types to the getRecordWriter()? Shouldn't getRecordWriter only be called for the reduce outputs?

> Hadoop should allow the user to use SequentialFileOutputformat as the output format and to choose  key/value classes that are different from those for map output.
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>          Key: HADOOP-115
>          URL: http://issues.apache.org/jira/browse/HADOOP-115
>      Project: Hadoop
>         Type: Improvement

>   Components: mapred
>     Reporter: Runping Qi
>  Attachments: hadoop-115_tk.patch
>
> When map tasks write intermediate data out, they always use SequencialFile RecordWriter with key/value classes from the job object.
> When the reducers write the final results out, its output format is obtained from the job object. By default, it is TextOutputFormat, and no conflicts.
> However, if one wants to use SequencialFileFormat for the final results, then the key/value classes are also obtained from the job object, the same as the map tasks' output. Now we have a problem. It is impossible for the map outputs and reducer outputs use different key/value classes, if one wants the reducers generate outputs in SequentialFileFormat.
> A simple fix would be to add another two attributes to JobConf class: mapOutputLeyClass and mapOutputValueClass. That allows the user to have different key/value classes for the intermediate and final outputs.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (HADOOP-115) Hadoop should allow the user to use SequentialFileOutputformat as the output format and to choose key/value classes that are different from those for map output.

Posted by "eric baldeschwieler (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/HADOOP-115?page=comments#action_12372781 ] 

eric baldeschwieler commented on HADOOP-115:
--------------------------------------------

+1

But MapOutput seems confusing.  Shouldn't these be called FinalOutput or ReduceOutput?


> Hadoop should allow the user to use SequentialFileOutputformat as the output format and to choose  key/value classes that are different from those for map output.
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>          Key: HADOOP-115
>          URL: http://issues.apache.org/jira/browse/HADOOP-115
>      Project: Hadoop
>         Type: Improvement
>   Components: mapred
>     Reporter: Runping Qi

>
> When map tasks write intermediate data out, they always use SequencialFile RecordWriter with key/value classes from the job object.
> When the reducers write the final results out, its output format is obtained from the job object. By default, it is TextOutputFormat, and no conflicts.
> However, if one wants to use SequencialFileFormat for the final results, then the key/value classes are also obtained from the job object, the same as the map tasks' output. Now we have a problem. It is impossible for the map outputs and reducer outputs use different key/value classes, if one wants the reducers generate outputs in SequentialFileFormat.
> A simple fix would be to add another two attributes to JobConf class: mapOutputLeyClass and mapOutputValueClass. That allows the user to have different key/value classes for the intermediate and final outputs.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (HADOOP-115) Hadoop should allow the user to use SequentialFileOutputformat as the output format and to choose key/value classes that are different from those for map output.

Posted by "Bryan Pendleton (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/HADOOP-115?page=comments#action_12372803 ] 

Bryan Pendleton commented on HADOOP-115:
----------------------------------------

I've implemented the above case with the existing code, and it's pretty simple... the output is actually always list(source), or perhaps a new type that encapsulates the list, and it just means that the map actually outputs a list of length 1, and the combiner/reducer concatenates lists.

> Hadoop should allow the user to use SequentialFileOutputformat as the output format and to choose  key/value classes that are different from those for map output.
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>          Key: HADOOP-115
>          URL: http://issues.apache.org/jira/browse/HADOOP-115
>      Project: Hadoop
>         Type: Improvement
>   Components: mapred
>     Reporter: Runping Qi

>
> When map tasks write intermediate data out, they always use SequencialFile RecordWriter with key/value classes from the job object.
> When the reducers write the final results out, its output format is obtained from the job object. By default, it is TextOutputFormat, and no conflicts.
> However, if one wants to use SequencialFileFormat for the final results, then the key/value classes are also obtained from the job object, the same as the map tasks' output. Now we have a problem. It is impossible for the map outputs and reducer outputs use different key/value classes, if one wants the reducers generate outputs in SequentialFileFormat.
> A simple fix would be to add another two attributes to JobConf class: mapOutputLeyClass and mapOutputValueClass. That allows the user to have different key/value classes for the intermediate and final outputs.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Updated: (HADOOP-115) permit reduce input types to differ from reduce output types

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/HADOOP-115?page=all ]

Doug Cutting updated HADOOP-115:
--------------------------------

        Summary: permit reduce input types to differ from reduce output types  (was: Hadoop should allow the user to use SequentialFileOutputformat as the output format and to choose  key/value classes that are different from those for map output.)
           type: New Feature  (was: Improvement)
    Description: 
When map tasks write intermediate data out, they always use SequencialFile RecordWriter with key/value classes from the job object.

When the reducers write the final results out, its output format is obtained from the job object. By default, it is TextOutputFormat, and no conflicts.
However, if one wants to use SequencialFileFormat for the final results, then the key/value classes are also obtained from the job object, the same as the map tasks' output. Now we have a problem. It is impossible for the map outputs and reducer outputs use different key/value classes, if one wants the reducers generate outputs in SequentialFileFormat.

A simple fix would be to add another two attributes to JobConf class: mapOutputLeyClass and mapOutputValueClass. That allows the user to have different key/value classes for the intermediate and final outputs.



  was:

When map tasks write intermediate data out, they always use SequencialFile RecordWriter with key/value classes from the job object.

When the reducers write the final results out, its output format is obtained from the job object. By default, it is TextOutputFormat, and no conflicts.
However, if one wants to use SequencialFileFormat for the final results, then the key/value classes are also obtained from the job object, the same as the map tasks' output. Now we have a problem. It is impossible for the map outputs and reducer outputs use different key/value classes, if one wants the reducers generate outputs in SequentialFileFormat.

A simple fix would be to add another two attributes to JobConf class: mapOutputLeyClass and mapOutputValueClass. That allows the user to have different key/value classes for the intermediate and final outputs.




> permit reduce input types to differ from reduce output types
> ------------------------------------------------------------
>
>          Key: HADOOP-115
>          URL: http://issues.apache.org/jira/browse/HADOOP-115
>      Project: Hadoop
>         Type: New Feature

>   Components: mapred
>     Reporter: Runping Qi
>     Assignee: Runping Qi
>  Attachments: hadoop-115_ReduceTask.patch, hadoop-115_tk.patch, patch_115.txt.2006_05_16
>
> When map tasks write intermediate data out, they always use SequencialFile RecordWriter with key/value classes from the job object.
> When the reducers write the final results out, its output format is obtained from the job object. By default, it is TextOutputFormat, and no conflicts.
> However, if one wants to use SequencialFileFormat for the final results, then the key/value classes are also obtained from the job object, the same as the map tasks' output. Now we have a problem. It is impossible for the map outputs and reducer outputs use different key/value classes, if one wants the reducers generate outputs in SequentialFileFormat.
> A simple fix would be to add another two attributes to JobConf class: mapOutputLeyClass and mapOutputValueClass. That allows the user to have different key/value classes for the intermediate and final outputs.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (HADOOP-115) Hadoop should allow the user to use SequentialFileOutputformat as the output format and to choose key/value classes that are different from those for map output.

Posted by "Teppo Kurki (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/HADOOP-115?page=comments#action_12372801 ] 

Teppo Kurki commented on HADOOP-115:
------------------------------------

" I.e., the input keys and values are drawn from a different
  domain than the output keys and values. Furthermore,
  the intermediate keys and values are from the same domain
  as the output keys and values. "

The Google MapReduce paper states the Web-Link graph example where this is not the case:

"Reverse Web-Link Graph: The map function outputs
(target; source) pairs for each link to a target
URL found in a page named source. The reduce
function concatenates the list of all source URLs associated
with a given target URL and emits the pair:
(target; list(source))".



> Hadoop should allow the user to use SequentialFileOutputformat as the output format and to choose  key/value classes that are different from those for map output.
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>          Key: HADOOP-115
>          URL: http://issues.apache.org/jira/browse/HADOOP-115
>      Project: Hadoop
>         Type: Improvement
>   Components: mapred
>     Reporter: Runping Qi

>
> When map tasks write intermediate data out, they always use SequencialFile RecordWriter with key/value classes from the job object.
> When the reducers write the final results out, its output format is obtained from the job object. By default, it is TextOutputFormat, and no conflicts.
> However, if one wants to use SequencialFileFormat for the final results, then the key/value classes are also obtained from the job object, the same as the map tasks' output. Now we have a problem. It is impossible for the map outputs and reducer outputs use different key/value classes, if one wants the reducers generate outputs in SequentialFileFormat.
> A simple fix would be to add another two attributes to JobConf class: mapOutputLeyClass and mapOutputValueClass. That allows the user to have different key/value classes for the intermediate and final outputs.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Assigned: (HADOOP-115) Hadoop should allow the user to use SequentialFileOutputformat as the output format and to choose key/value classes that are different from those for map output.

Posted by "Owen O'Malley (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/HADOOP-115?page=all ]

Owen O'Malley reassigned HADOOP-115:
------------------------------------

    Assign To: Runping Qi

Runping is going to take a look at fixing the patch.

> Hadoop should allow the user to use SequentialFileOutputformat as the output format and to choose  key/value classes that are different from those for map output.
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>          Key: HADOOP-115
>          URL: http://issues.apache.org/jira/browse/HADOOP-115
>      Project: Hadoop
>         Type: Improvement

>   Components: mapred
>     Reporter: Runping Qi
>     Assignee: Runping Qi
>  Attachments: hadoop-115_ReduceTask.patch, hadoop-115_tk.patch
>
> When map tasks write intermediate data out, they always use SequencialFile RecordWriter with key/value classes from the job object.
> When the reducers write the final results out, its output format is obtained from the job object. By default, it is TextOutputFormat, and no conflicts.
> However, if one wants to use SequencialFileFormat for the final results, then the key/value classes are also obtained from the job object, the same as the map tasks' output. Now we have a problem. It is impossible for the map outputs and reducer outputs use different key/value classes, if one wants the reducers generate outputs in SequentialFileFormat.
> A simple fix would be to add another two attributes to JobConf class: mapOutputLeyClass and mapOutputValueClass. That allows the user to have different key/value classes for the intermediate and final outputs.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Updated: (HADOOP-115) Hadoop should allow the user to use SequentialFileOutputformat as the output format and to choose key/value classes that are different from those for map output.

Posted by "Runping Qi (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/HADOOP-115?page=all ]

Runping Qi updated HADOOP-115:
------------------------------

    Attachment: patch_115.txt.2006_05_16

My patch is attached.

 

Change highlights:

JobConf.java:
    Added set/getMapOutputKey/ValueClass methods
    Modified getOutputKeyComparatorClass 

MapTask.java:
    call getMapOutputKey/ValueClass instead of getOutputKey/ValueClass

MapReduce.java:
    call getMapOutputKey/ValueClass instead of getOutputKey/ValueClass

MapOutputFile.java:
    call getMapOutputKey/ValueClass instead of getOutputKey/ValueClass

 

I've run a job with a combiner, with UTF8 as the  map output value class, and  CrawledDoc as the output value class.

The job completed successfully.



> Hadoop should allow the user to use SequentialFileOutputformat as the output format and to choose  key/value classes that are different from those for map output.
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>          Key: HADOOP-115
>          URL: http://issues.apache.org/jira/browse/HADOOP-115
>      Project: Hadoop
>         Type: Improvement

>   Components: mapred
>     Reporter: Runping Qi
>     Assignee: Runping Qi
>  Attachments: hadoop-115_ReduceTask.patch, hadoop-115_tk.patch, patch_115.txt.2006_05_16
>
> When map tasks write intermediate data out, they always use SequencialFile RecordWriter with key/value classes from the job object.
> When the reducers write the final results out, its output format is obtained from the job object. By default, it is TextOutputFormat, and no conflicts.
> However, if one wants to use SequencialFileFormat for the final results, then the key/value classes are also obtained from the job object, the same as the map tasks' output. Now we have a problem. It is impossible for the map outputs and reducer outputs use different key/value classes, if one wants the reducers generate outputs in SequentialFileFormat.
> A simple fix would be to add another two attributes to JobConf class: mapOutputLeyClass and mapOutputValueClass. That allows the user to have different key/value classes for the intermediate and final outputs.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Reopened: (HADOOP-115) Hadoop should allow the user to use SequentialFileOutputformat as the output format and to choose key/value classes that are different from those for map output.

Posted by "Owen O'Malley (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/HADOOP-115?page=all ]
     
Owen O'Malley reopened HADOOP-115:
----------------------------------


Let's reopen this. I've had discussions with Runping today, and it seems to me that:

1. It is basically free with respect to the framework.
2. It allows more applications to be written using the framework rather than working around the framework.
3. It is less clear that we should allow the user to change the key type in the reduce, but since the current API does allow them to change the value (if not the type), I think we should be consistent and allow a type change too.

I propose:

1. Add {set,get}MapOutput{Key,Value}Class functions in JobConf.
2. The default values for getMapOutput{Key,Value}Class are the values from getOutput{Key,Value}Class.
3. Always check the types in the output collector rather that the OutputFormat, so that even text output files are check for type correctness.

We should include a javadoc comment for setMapOutputKeyClass will warn that changing the key in the reduce will mean that your output is NOT sorted.

> Hadoop should allow the user to use SequentialFileOutputformat as the output format and to choose  key/value classes that are different from those for map output.
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>          Key: HADOOP-115
>          URL: http://issues.apache.org/jira/browse/HADOOP-115
>      Project: Hadoop
>         Type: Improvement
>   Components: mapred
>     Reporter: Runping Qi

>
> When map tasks write intermediate data out, they always use SequencialFile RecordWriter with key/value classes from the job object.
> When the reducers write the final results out, its output format is obtained from the job object. By default, it is TextOutputFormat, and no conflicts.
> However, if one wants to use SequencialFileFormat for the final results, then the key/value classes are also obtained from the job object, the same as the map tasks' output. Now we have a problem. It is impossible for the map outputs and reducer outputs use different key/value classes, if one wants the reducers generate outputs in SequentialFileFormat.
> A simple fix would be to add another two attributes to JobConf class: mapOutputLeyClass and mapOutputValueClass. That allows the user to have different key/value classes for the intermediate and final outputs.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (HADOOP-115) Hadoop should allow the user to use SequentialFileOutputformat as the output format and to choose key/value classes that are different from those for map output.

Posted by "Owen O'Malley (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/HADOOP-115?page=comments#action_12372967 ] 

Owen O'Malley commented on HADOOP-115:
--------------------------------------

I should have addressed the combiner before. *smile* Of course the combiner input and output has to match the map output types. So, it looks like:

map: k1,v1 -> seq(k2,v2)
combine: k2,seq(v2) -> seq(k2,v2)
reduce: k2, seq(v2) -> seq(k3,v3)

So the only extra code is to set/get the types for k2/v2 (or equivalent k3/v3), although I would recommend adding a type check in the reduce collector. It is completely upward compatible.

As for user confusion, I've already had to explain this restriction (k2==k3 and v2==v3) far more times than I'd like.

On a side note, we could hack around the problem by defining an OutputFormat that uses SequenceFileWriter, but doesn't open the file until the first key/value pair is written and takes the types from the first instances. But that breaks when someone puts the type check into the reduce collector.

> Hadoop should allow the user to use SequentialFileOutputformat as the output format and to choose  key/value classes that are different from those for map output.
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>          Key: HADOOP-115
>          URL: http://issues.apache.org/jira/browse/HADOOP-115
>      Project: Hadoop
>         Type: Improvement

>   Components: mapred
>     Reporter: Runping Qi

>
> When map tasks write intermediate data out, they always use SequencialFile RecordWriter with key/value classes from the job object.
> When the reducers write the final results out, its output format is obtained from the job object. By default, it is TextOutputFormat, and no conflicts.
> However, if one wants to use SequencialFileFormat for the final results, then the key/value classes are also obtained from the job object, the same as the map tasks' output. Now we have a problem. It is impossible for the map outputs and reducer outputs use different key/value classes, if one wants the reducers generate outputs in SequentialFileFormat.
> A simple fix would be to add another two attributes to JobConf class: mapOutputLeyClass and mapOutputValueClass. That allows the user to have different key/value classes for the intermediate and final outputs.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Updated: (HADOOP-115) Hadoop should allow the user to use SequentialFileOutputformat as the output format and to choose key/value classes that are different from those for map output.

Posted by "Teppo Kurki (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/HADOOP-115?page=all ]

Teppo Kurki updated HADOOP-115:
-------------------------------

    Attachment: hadoop-115_ReduceTask.patch

Patch including

TestReduceTask
- generates a bunch of SequenceFiles and reduces them by running a single ReduceTask
- two test methods, one where input is just copied to output and one where the Reducer swaps keys and values
- Reducer checks that all generated key-value pairs are reduced by key
- checks that the resulting output file contains what it's supposed to

JobConf
- the necessary set/getMapOutputKey/ValueClass methods
- getOutputComparator uses MapKeyClass if one is specified

ReduceTask
- append and sort phases get the classes from getMapOutput.. methods

This should take care of the Reduce part of the problem. MapTask should be also adjusted accordingly, but since I haven'twritten  a test for that I haven't done it yet.

Owen, I didn't get your comment on handling the combiners - doesn't the combiner just use the map OutputCollector underneath and as you put it 

map: k1,v1 -> seq(k2,v2)
combine: k2,seq(v2) -> seq(k2,v2) 

the outputs are exactly the same, even if the combiner is technically a Reducer?







> Hadoop should allow the user to use SequentialFileOutputformat as the output format and to choose  key/value classes that are different from those for map output.
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>          Key: HADOOP-115
>          URL: http://issues.apache.org/jira/browse/HADOOP-115
>      Project: Hadoop
>         Type: Improvement

>   Components: mapred
>     Reporter: Runping Qi
>  Attachments: hadoop-115_ReduceTask.patch, hadoop-115_tk.patch
>
> When map tasks write intermediate data out, they always use SequencialFile RecordWriter with key/value classes from the job object.
> When the reducers write the final results out, its output format is obtained from the job object. By default, it is TextOutputFormat, and no conflicts.
> However, if one wants to use SequencialFileFormat for the final results, then the key/value classes are also obtained from the job object, the same as the map tasks' output. Now we have a problem. It is impossible for the map outputs and reducer outputs use different key/value classes, if one wants the reducers generate outputs in SequentialFileFormat.
> A simple fix would be to add another two attributes to JobConf class: mapOutputLeyClass and mapOutputValueClass. That allows the user to have different key/value classes for the intermediate and final outputs.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (HADOOP-115) Hadoop should allow the user to use SequentialFileOutputformat as the output format and to choose key/value classes that are different from those for map output.

Posted by "Teppo Kurki (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/HADOOP-115?page=comments#action_12372790 ] 

Teppo Kurki commented on HADOOP-115:
------------------------------------

+1

My original post about the issue gives a simple case that would benefit from this: http://www.mail-archive.com/hadoop-user%40lucene.apache.org/msg00073.html

As already said, this would add more transformational power to Hadoop and make certain cases more straightforward.



> Hadoop should allow the user to use SequentialFileOutputformat as the output format and to choose  key/value classes that are different from those for map output.
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>          Key: HADOOP-115
>          URL: http://issues.apache.org/jira/browse/HADOOP-115
>      Project: Hadoop
>         Type: Improvement
>   Components: mapred
>     Reporter: Runping Qi

>
> When map tasks write intermediate data out, they always use SequencialFile RecordWriter with key/value classes from the job object.
> When the reducers write the final results out, its output format is obtained from the job object. By default, it is TextOutputFormat, and no conflicts.
> However, if one wants to use SequencialFileFormat for the final results, then the key/value classes are also obtained from the job object, the same as the map tasks' output. Now we have a problem. It is impossible for the map outputs and reducer outputs use different key/value classes, if one wants the reducers generate outputs in SequentialFileFormat.
> A simple fix would be to add another two attributes to JobConf class: mapOutputLeyClass and mapOutputValueClass. That allows the user to have different key/value classes for the intermediate and final outputs.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (HADOOP-115) Hadoop should allow the user to use SequentialFileOutputformat as the output format and to choose key/value classes that are different from those for map output.

Posted by "Darek Zbik (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/HADOOP-115?page=comments#action_12372700 ] 

Darek Zbik commented on HADOOP-115:
-----------------------------------

New attibutes in the JobConf should denote output from reduce task not from map as suggest a the name mapOutputValueClass.

> Hadoop should allow the user to use SequentialFileOutputformat as the output format and to choose  key/value classes that are different from those for map output.
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>          Key: HADOOP-115
>          URL: http://issues.apache.org/jira/browse/HADOOP-115
>      Project: Hadoop
>         Type: Improvement
>   Components: mapred
>     Reporter: Runping Qi

>
> When map tasks write intermediate data out, they always use SequencialFile RecordWriter with key/value classes from the job object.
> When the reducers write the final results out, its output format is obtained from the job object. By default, it is TextOutputFormat, and no conflicts.
> However, if one wants to use SequencialFileFormat for the final results, then the key/value classes are also obtained from the job object, the same as the map tasks' output. Now we have a problem. It is impossible for the map outputs and reducer outputs use different key/value classes, if one wants the reducers generate outputs in SequentialFileFormat.
> A simple fix would be to add another two attributes to JobConf class: mapOutputLeyClass and mapOutputValueClass. That allows the user to have different key/value classes for the intermediate and final outputs.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira