You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Runping Qi (JIRA)" <ji...@apache.org> on 2006/03/31 17:31:40 UTC
[jira] Created: (HADOOP-115) Hadoop should allow the user to use
SequentialFileOutputformat as the output format and to choose key/value
classes that are different from those for map output.
Hadoop should allow the user to use SequentialFileOutputformat as the output format and to choose key/value classes that are different from those for map output.
-------------------------------------------------------------------------------------------------------------------------------------------------------------------
Key: HADOOP-115
URL: http://issues.apache.org/jira/browse/HADOOP-115
Project: Hadoop
Type: Improvement
Components: mapred
Reporter: Runping Qi
When map tasks write intermediate data out, they always use SequencialFile RecordWriter with key/value classes from the job object.
When the reducers write the final results out, its output format is obtained from the job object. By default, it is TextOutputFormat, and no conflicts.
However, if one wants to use SequencialFileFormat for the final results, then the key/value classes are also obtained from the job object, the same as the map tasks' output. Now we have a problem. It is impossible for the map outputs and reducer outputs use different key/value classes, if one wants the reducers generate outputs in SequentialFileFormat.
A simple fix would be to add another two attributes to JobConf class: mapOutputLeyClass and mapOutputValueClass. That allows the user to have different key/value classes for the intermediate and final outputs.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
Re: [jira] Commented: (HADOOP-115) Hadoop should allow the user to
use SequentialFileOutputformat as the output format and to choose key/value
classes that are different from those for map output.
Posted by Doug Cutting <cu...@apache.org>.
Konstantin Shvachko wrote:
> Here is another example, that I dealt with.
> I wanted to use different value types (long, float or string) for both
> map and reduce tasks,
> depending on the actual key values. So the solution was to encode the
> value type into the key value.
> I used keys of the form
> l:<name> - indicating the value type is expected to be long
> f:<name> - value type is expected to be float
> s:<name> - value is a string
> The example is under HADOOP-95.
> Thought somebody might find it practical.
On a related note, ObjectWritable can be used as input or output type,
and can wrap any Writable class, thus permitting polymorphic inputs and
outputs. Nutch uses this to, e.g., combine a URL's incoming anchor
texts and its content when indexing. The input type is ObjectWritable,
and the indexer's InputFormat wraps values from a variety of files. The
indexing reducer can then use the 'instanceof' operator to determine how
to process each input value. To be more object-oriented, one could have
all of these classes implement some Indexable interface whose methods
are invoked when reducing.
Doug
Re: [jira] Commented: (HADOOP-115) Hadoop should allow the user to
use SequentialFileOutputformat as the output format and to choose key/value
classes that are different from those for map output.
Posted by Konstantin Shvachko <sh...@yahoo-inc.com>.
I agree that the framework must be as general as possible. Which means
one should use some simple
data structure for keys and value, like string or BytesWritable.
Also nothing prevents us from implementing other types on top of the
framework as an optional
layer of higher level API.
Here is another example, that I dealt with.
I wanted to use different value types (long, float or string) for both
map and reduce tasks,
depending on the actual key values. So the solution was to encode the
value type into the key value.
I used keys of the form
l:<name> - indicating the value type is expected to be long
f:<name> - value type is expected to be float
s:<name> - value is a string
The example is under HADOOP-95.
Thought somebody might find it practical.
--Konstantin
Doug Cutting wrote:
> Eric Baldeschwieler wrote:
>
>> An observation... this whole thread is about limits caused by type
>> safety. Interestingly, the other implementation of map-reduce does
>> not support types at all. Everything is a string.
>>
>> So I agree that our departure from the paper is the problem. ;-)
>
>
> A corollary is that one could simply use BytesWritable for all one's
> keys and values, altering only one's WritableComparator
> implementation, and one would not encounter this problem. The use of
> types in Hadoop is thus an optional feature. One could even layer a
> different type system on top of BytesWritable that exhibits the
> desired properties.
>
>> I'm comfortable letting this lie for a while. But I predict we've
>> not heard the last of it.
>
>
> Owen seems to be picking it up, which is fine by me.
>
> Doug
>
>
>
Re: [jira] Commented: (HADOOP-115) Hadoop should allow the user to
use SequentialFileOutputformat as the output format and to choose key/value
classes that are different from those for map output.
Posted by Doug Cutting <cu...@apache.org>.
Eric Baldeschwieler wrote:
> An observation... this whole thread is about limits caused by type
> safety. Interestingly, the other implementation of map-reduce does not
> support types at all. Everything is a string.
>
> So I agree that our departure from the paper is the problem. ;-)
A corollary is that one could simply use BytesWritable for all one's
keys and values, altering only one's WritableComparator implementation,
and one would not encounter this problem. The use of types in Hadoop is
thus an optional feature. One could even layer a different type system
on top of BytesWritable that exhibits the desired properties.
> I'm comfortable letting this lie for a while. But I predict we've not
> heard the last of it.
Owen seems to be picking it up, which is fine by me.
Doug
Re: [jira] Commented: (HADOOP-115) Hadoop should allow the user to use SequentialFileOutputformat as the output format and to choose key/value classes that are different from those for map output.
Posted by Eric Baldeschwieler <er...@yahoo-inc.com>.
An observation... this whole thread is about limits caused by type
safety. Interestingly, the other implementation of map-reduce does
not support types at all. Everything is a string.
So I agree that our departure from the paper is the problem. ;-)
I'm comfortable letting this lie for a while. But I predict we've
not heard the last of it.
On Apr 2, 2006, at 10:29 PM, Doug Cutting wrote:
> Runping Qi wrote:
>> The argument of using local combiners is interesting. To me,
>> combiner class
>> is just another layer of transformer. It does not mean that the
>> combiner
>> class has to be the same as the reducer class. The only criteria
>> is that
>> they meet the associate rule: Let L1, L2, ..., Ln and K1,
>> K2, .., Km be two partitions of S, then Reduce(list(Combiner(L1),
>> Combiner(L2),..., Combiner(Ln))) and Reduce(list(Combiner(K1),
>> Combiner(K2), ..., Combiner(Km)) are the
>> same.
>> A special (maybe very common) scenario is that combiner and
>> reducer are the
>> same class and reduce function is associate. However, this needs
>> not to be
>> the case in general. And the class of the reduce outputs need not
>> to be the
>> same as that of the combiner, if the combiner and the reducer are
>> not the
>> same class.
>
> This indeed may be be an intriguing generalization of the MapReduce
> model. But it does add more possible failure modes. At present we
> have far too few unit tests for the existing, simpler MapReduce
> model, and the platform is still shakey. Thus I am reluctant to
> spend a lot of extending the model in ways that are not absolutely
> essential.
>
> My goal is for Hadoop to be widely used. I do not feel that the
> power of the MapReduce model is currently a primary bottleneck to
> wider adoption. The larger issues we face are performance,
> reliability, scalability and documentation.
>
> If I am to commit a patch, then I must feel that I can support and
> maintain it, that it fits within my priorities. Otherwise, if it
> causes problems that I don't have time to attend to (even if this
> only means reviewing and testing fixes submitted by others) then
> the quality of the system will decrease, a vector we must avoid.
>
> Currently we have just four committers on Hadoop. For Mike and
> Andrzej, Nutch is a secondary effort. Owen has been voted in as a
> Hadoop committer, but his paperwork is not yet complete. So I am
> the bottleneck. I spend a lot of time on annoying yet critical
> issues like making sure that recent extensions to Hadoop don't
> break Nutch running in pseudo-distributed mode on Windows.
>
> I don't particularly like things this way, but that's where we are
> right now. The best way to get out of here is for folks who'd like
> to be committers to submit high-quality, well documented, well-
> formatted, non-disruptive, unit-test-bearing patches that are easy
> for me to apply and make Hadoop easier to use and more reliable,
> thus earning points towards becoming committers. If we have more
> committers then we should be able to advance with confidence on
> more fronts in parallel.
>
> Doug
Re: [jira] Commented: (HADOOP-115) Hadoop should allow the user to
use SequentialFileOutputformat as the output format and to choose key/value
classes that are different from those for map output.
Posted by Doug Cutting <cu...@apache.org>.
Runping Qi wrote:
> The argument of using local combiners is interesting. To me, combiner class
> is just another layer of transformer. It does not mean that the combiner
> class has to be the same as the reducer class. The only criteria is that
> they meet the associate rule:
> Let L1, L2, ..., Ln and K1, K2, .., Km be two partitions of S, then
> Reduce(list(Combiner(L1), Combiner(L2),..., Combiner(Ln))) and
> Reduce(list(Combiner(K1), Combiner(K2), ..., Combiner(Km)) are the
> same.
>
> A special (maybe very common) scenario is that combiner and reducer are the
> same class and reduce function is associate. However, this needs not to be
> the case in general. And the class of the reduce outputs need not to be the
> same as that of the combiner, if the combiner and the reducer are not the
> same class.
This indeed may be be an intriguing generalization of the MapReduce
model. But it does add more possible failure modes. At present we have
far too few unit tests for the existing, simpler MapReduce model, and
the platform is still shakey. Thus I am reluctant to spend a lot of
extending the model in ways that are not absolutely essential.
My goal is for Hadoop to be widely used. I do not feel that the power
of the MapReduce model is currently a primary bottleneck to wider
adoption. The larger issues we face are performance, reliability,
scalability and documentation.
If I am to commit a patch, then I must feel that I can support and
maintain it, that it fits within my priorities. Otherwise, if it causes
problems that I don't have time to attend to (even if this only means
reviewing and testing fixes submitted by others) then the quality of the
system will decrease, a vector we must avoid.
Currently we have just four committers on Hadoop. For Mike and Andrzej,
Nutch is a secondary effort. Owen has been voted in as a Hadoop
committer, but his paperwork is not yet complete. So I am the
bottleneck. I spend a lot of time on annoying yet critical issues like
making sure that recent extensions to Hadoop don't break Nutch running
in pseudo-distributed mode on Windows.
I don't particularly like things this way, but that's where we are right
now. The best way to get out of here is for folks who'd like to be
committers to submit high-quality, well documented, well-formatted,
non-disruptive, unit-test-bearing patches that are easy for me to apply
and make Hadoop easier to use and more reliable, thus earning points
towards becoming committers. If we have more committers then we should
be able to advance with confidence on more fronts in parallel.
Doug
RE: [jira] Commented: (HADOOP-115) Hadoop should allow the user to use SequentialFileOutputformat as the output format and to choose key/value classes that are different from those for map output.
Posted by Runping Qi <ru...@yahoo-inc.com>.
The argument of using local combiners is interesting. To me, combiner class
is just another layer of transformer. It does not mean that the combiner
class has to be the same as the reducer class. The only criteria is that
they meet the associate rule:
Let L1, L2, ..., Ln and K1, K2, .., Km be two partitions of S, then
Reduce(list(Combiner(L1), Combiner(L2),..., Combiner(Ln))) and
Reduce(list(Combiner(K1), Combiner(K2), ..., Combiner(Km)) are the
same.
A special (maybe very common) scenario is that combiner and reducer are the
same class and reduce function is associate. However, this needs not to be
the case in general. And the class of the reduce outputs need not to be the
same as that of the combiner, if the combiner and the reducer are not the
same class.
Runping
-----Original Message-----
From: Doug Cutting [mailto:cutting@apache.org]
Sent: Sunday, April 02, 2006 1:30 PM
To: hadoop-dev@lucene.apache.org
Subject: Re: [jira] Commented: (HADOOP-115) Hadoop should allow the user to
use SequentialFileOutputformat as the output format and to choose key/value
classes that are different from those for map output.
Eric Baldeschwieler wrote:
> I can not think of a case where this proposed extension complicates
> code or reduces compressibility. Since it is backwards compatible with
> your desired API, purists can simply ignore the option.
It makes the insertion of a combiner no longer transparent. The reducer
would have to know whether a combiner had been used in order to know how
to process the map output.
In general this seems like a micro-optimization. It saves little code.
Instead of writing 'collector.collect(key, new List(value))' one could
write 'collector.collect(key, value)'.
Taking this to its logical extreme, in the classic word-count use of
MapReduce, why should one have to emit ones for the map values? Why
have a value at all? Why not add a collect(key) method, then permit
reducers to be passed an iterator which returns null for all values
where collect(key) was called. That would save a little code and make
the intermediate data a bit smaller. So should we do it? I'd argue not.
Doug
Re: [jira] Commented: (HADOOP-115) Hadoop should allow the user to
use SequentialFileOutputformat as the output format and to choose key/value
classes that are different from those for map output.
Posted by Doug Cutting <cu...@apache.org>.
Eric Baldeschwieler wrote:
> I can not think of a case where this proposed extension complicates
> code or reduces compressibility. Since it is backwards compatible with
> your desired API, purists can simply ignore the option.
It makes the insertion of a combiner no longer transparent. The reducer
would have to know whether a combiner had been used in order to know how
to process the map output.
In general this seems like a micro-optimization. It saves little code.
Instead of writing 'collector.collect(key, new List(value))' one could
write 'collector.collect(key, value)'.
Taking this to its logical extreme, in the classic word-count use of
MapReduce, why should one have to emit ones for the map values? Why
have a value at all? Why not add a collect(key) method, then permit
reducers to be passed an iterator which returns null for all values
where collect(key) was called. That would save a little code and make
the intermediate data a bit smaller. So should we do it? I'd argue not.
Doug
Re: [jira] Commented: (HADOOP-115) Hadoop should allow the user to use SequentialFileOutputformat as the output format and to choose key/value classes that are different from those for map output.
Posted by Eric Baldeschwieler <er...@yahoo-inc.com>.
Sure there is something wrong with requiring extra map-reduce
passes. Without significant development it can be very expensive
(shuffling, sorting and rewriting your whole output set can be a
significant burden). Pointlessly so, since the extension is clear,
safe and easier to explain then the restriction.
I think we can all agree that a project goal is to keep the design as
simple and focused as possible. I'd find an argument against an
extension based on those goals pretty compelling, but the lack of a
feature in a paper from google doesn't seem like a compelling reason
to reject something. The hadoop approach to many decisions varies
from google's, this is not a bad thing.
I can not think of a case where this proposed extension complicates
code or reduces compressibility. Since it is backwards compatible
with your desired API, purists can simply ignore the option.
On Apr 1, 2006, at 9:29 AM, Andrew McNabb wrote:
> On Sat, Apr 01, 2006 at 06:19:27PM +0100, Teppo Kurki (JIRA) wrote:
>>
>> My original post about the issue gives a simple case that would
>> benefit from this: http://www.mail-archive.com/hadoop-user%
>> 40lucene.apache.org/msg00073.html
>>
>
> This should be done in two map-reduce phases. There's nothing wrong
> with running two phases (or 10,000).
>
> --
> Andrew McNabb
> http://www.mcnabbs.org/andrew/
> PGP Fingerprint: 8A17 B57C 6879 1863 DE55 8012 AB4D 6098 8826 6868
Re: [jira] Commented: (HADOOP-115) Hadoop should allow the user to use SequentialFileOutputformat as the output format and to choose key/value classes that are different from those for map output.
Posted by Andrew McNabb <am...@mcnabbs.org>.
On Sat, Apr 01, 2006 at 06:19:27PM +0100, Teppo Kurki (JIRA) wrote:
>
> My original post about the issue gives a simple case that would benefit from this: http://www.mail-archive.com/hadoop-user%40lucene.apache.org/msg00073.html
>
This should be done in two map-reduce phases. There's nothing wrong
with running two phases (or 10,000).
--
Andrew McNabb
http://www.mcnabbs.org/andrew/
PGP Fingerprint: 8A17 B57C 6879 1863 DE55 8012 AB4D 6098 8826 6868
Re: [jira] Reopened: (HADOOP-115) Hadoop should allow the user to use SequentialFileOutputformat as the output format and to choose key/value classes that are different from those for map output.
Posted by Andrew McNabb <am...@mcnabbs.org>.
On Sat, Apr 01, 2006 at 09:21:26AM +0100, Owen O'Malley (JIRA) wrote:
>
> Let's reopen this. I've had discussions with Runping today, and it seems to me that:
> 3. It is less clear that we should allow the user to change the key
> type in the reduce, but since the current API does allow them to
> change the value (if not the type), I think we should be consistent
> and allow a type change too.
I strongly disagree. I think it's unnecessary, and I think it breaks
the model too much. Google has many thousands of map reduce
applications, and they haven't broken the model yet. I don't really
care about output formats, but I think we're just asking for trouble if
we allow the type of the reduce output to be different from the type of
the map output.
--
Andrew McNabb
http://www.mcnabbs.org/andrew/
PGP Fingerprint: 8A17 B57C 6879 1863 DE55 8012 AB4D 6098 8826 6868
Re: [jira] Resolved: (HADOOP-115) Hadoop should allow the user to use SequentialFileOutputformat as the output format and to choose key/value classes that are different from those for map output.
Posted by Andrew McNabb <am...@mcnabbs.org>.
On Fri, Mar 31, 2006 at 06:17:48PM +0100, Doug Cutting (JIRA) wrote:
>
> This is the way it is supposed to work. From the MapReduce paper:
>
> map (k1,v1) ! list(k2,v2)
> reduce (k2,list(v2)) ! list(v2)
>
> I.e., the input keys and values are drawn from a different
> domain than the output keys and values. Furthermore,
> the intermediate keys and values are from the same domain
> as the output keys and values.
>
> I am closing this bug. If someone feels strongly that we should extend the MapReduce model in this direction, then we can re-open it. But, as it stands, things work as intended.
>
I agree with you. If you need to change keys, run a second map-reduce
phase.
--
Andrew McNabb
http://www.mcnabbs.org/andrew/
PGP Fingerprint: 8A17 B57C 6879 1863 DE55 8012 AB4D 6098 8826 6868
[jira] Commented: (HADOOP-115) Hadoop should allow the user to use
SequentialFileOutputformat as the output format and to choose key/value
classes that are different from those for map output.
Posted by "eric baldeschwieler (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/HADOOP-115?page=comments#action_12372783 ]
eric baldeschwieler commented on HADOOP-115:
--------------------------------------------
Ah! Are you suggesting that getOutput* describes the final classes output from reduce always and if you don't set the new variables MapOutput* it also controls the map? That is clear enough and backwards compatible. I just was not looking at it the right way!
> Hadoop should allow the user to use SequentialFileOutputformat as the output format and to choose key/value classes that are different from those for map output.
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: HADOOP-115
> URL: http://issues.apache.org/jira/browse/HADOOP-115
> Project: Hadoop
> Type: Improvement
> Components: mapred
> Reporter: Runping Qi
>
> When map tasks write intermediate data out, they always use SequencialFile RecordWriter with key/value classes from the job object.
> When the reducers write the final results out, its output format is obtained from the job object. By default, it is TextOutputFormat, and no conflicts.
> However, if one wants to use SequencialFileFormat for the final results, then the key/value classes are also obtained from the job object, the same as the map tasks' output. Now we have a problem. It is impossible for the map outputs and reducer outputs use different key/value classes, if one wants the reducers generate outputs in SequentialFileFormat.
> A simple fix would be to add another two attributes to JobConf class: mapOutputLeyClass and mapOutputValueClass. That allows the user to have different key/value classes for the intermediate and final outputs.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
[jira] Resolved: (HADOOP-115) Hadoop should allow the user to use
SequentialFileOutputformat as the output format and to choose key/value
classes that are different from those for map output.
Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/HADOOP-115?page=all ]
Doug Cutting resolved HADOOP-115:
---------------------------------
Resolution: Won't Fix
This is the way it is supposed to work. From the MapReduce paper:
map (k1,v1) ! list(k2,v2)
reduce (k2,list(v2)) ! list(v2)
I.e., the input keys and values are drawn from a different
domain than the output keys and values. Furthermore,
the intermediate keys and values are from the same domain
as the output keys and values.
I am closing this bug. If someone feels strongly that we should extend the MapReduce model in this direction, then we can re-open it. But, as it stands, things work as intended.
> Hadoop should allow the user to use SequentialFileOutputformat as the output format and to choose key/value classes that are different from those for map output.
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: HADOOP-115
> URL: http://issues.apache.org/jira/browse/HADOOP-115
> Project: Hadoop
> Type: Improvement
> Components: mapred
> Reporter: Runping Qi
>
> When map tasks write intermediate data out, they always use SequencialFile RecordWriter with key/value classes from the job object.
> When the reducers write the final results out, its output format is obtained from the job object. By default, it is TextOutputFormat, and no conflicts.
> However, if one wants to use SequencialFileFormat for the final results, then the key/value classes are also obtained from the job object, the same as the map tasks' output. Now we have a problem. It is impossible for the map outputs and reducer outputs use different key/value classes, if one wants the reducers generate outputs in SequentialFileFormat.
> A simple fix would be to add another two attributes to JobConf class: mapOutputLeyClass and mapOutputValueClass. That allows the user to have different key/value classes for the intermediate and final outputs.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
[jira] Updated: (HADOOP-115) Hadoop should allow the user to use
SequentialFileOutputformat as the output format and to choose key/value
classes that are different from those for map output.
Posted by "Teppo Kurki (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/HADOOP-115?page=all ]
Teppo Kurki updated HADOOP-115:
-------------------------------
Attachment: hadoop-115_tk.patch
(1) adds JobConf.set/getMapKey/ValueClass
(2) changes in ReduceTask so that the Map phase classes are used in append & sort
(3) explicit key and value class specification in OutputFormat.getRecordWriter so that they are not automatically pulled out of JobConf
Tests run ok, but I haven't added tests for different Map/Reduce key/value classes.
Owen's suggestion on automatically typing the OutputFormat when the first record is written would cause changes in fewer classes, but I had this already working...
> Hadoop should allow the user to use SequentialFileOutputformat as the output format and to choose key/value classes that are different from those for map output.
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: HADOOP-115
> URL: http://issues.apache.org/jira/browse/HADOOP-115
> Project: Hadoop
> Type: Improvement
> Components: mapred
> Reporter: Runping Qi
> Attachments: hadoop-115_tk.patch
>
> When map tasks write intermediate data out, they always use SequencialFile RecordWriter with key/value classes from the job object.
> When the reducers write the final results out, its output format is obtained from the job object. By default, it is TextOutputFormat, and no conflicts.
> However, if one wants to use SequencialFileFormat for the final results, then the key/value classes are also obtained from the job object, the same as the map tasks' output. Now we have a problem. It is impossible for the map outputs and reducer outputs use different key/value classes, if one wants the reducers generate outputs in SequentialFileFormat.
> A simple fix would be to add another two attributes to JobConf class: mapOutputLeyClass and mapOutputValueClass. That allows the user to have different key/value classes for the intermediate and final outputs.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
[jira] Commented: (HADOOP-115) Hadoop should allow the user to use
SequentialFileOutputformat as the output format and to choose key/value
classes that are different from those for map output.
Posted by "Teppo Kurki (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/HADOOP-115?page=comments#action_12373780 ]
Teppo Kurki commented on HADOOP-115:
------------------------------------
You're right about the getMapOutputComparatorClass and the needless interface change.
Automatic/deferred SequenceFile typing based on the first record doesn't seem feasible because SequenceFile.Writer.append(byte[]...) is used here and there.
I already had a second go at this with less changes, but I'll keep it to myself until I can put some unit tests together.
At least I'm learning about how things work under the hood.
> Hadoop should allow the user to use SequentialFileOutputformat as the output format and to choose key/value classes that are different from those for map output.
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: HADOOP-115
> URL: http://issues.apache.org/jira/browse/HADOOP-115
> Project: Hadoop
> Type: Improvement
> Components: mapred
> Reporter: Runping Qi
> Attachments: hadoop-115_tk.patch
>
> When map tasks write intermediate data out, they always use SequencialFile RecordWriter with key/value classes from the job object.
> When the reducers write the final results out, its output format is obtained from the job object. By default, it is TextOutputFormat, and no conflicts.
> However, if one wants to use SequencialFileFormat for the final results, then the key/value classes are also obtained from the job object, the same as the map tasks' output. Now we have a problem. It is impossible for the map outputs and reducer outputs use different key/value classes, if one wants the reducers generate outputs in SequentialFileFormat.
> A simple fix would be to add another two attributes to JobConf class: mapOutputLeyClass and mapOutputValueClass. That allows the user to have different key/value classes for the intermediate and final outputs.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
[jira] Resolved: (HADOOP-115) permit reduce input types to differ
from reduce output types
Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/HADOOP-115?page=all ]
Doug Cutting resolved HADOOP-115:
---------------------------------
Fix Version: 0.3
Resolution: Fixed
Your patch has some strange formatting, and some spurious whitespace changes. I fixed most of these. It would also be best to have a unit test that uses this feature. But I'm tired of this issue, and it should do no harm, so I committed it.
> permit reduce input types to differ from reduce output types
> ------------------------------------------------------------
>
> Key: HADOOP-115
> URL: http://issues.apache.org/jira/browse/HADOOP-115
> Project: Hadoop
> Type: New Feature
> Components: mapred
> Reporter: Runping Qi
> Assignee: Runping Qi
> Fix For: 0.3
> Attachments: hadoop-115_ReduceTask.patch, hadoop-115_tk.patch, patch_115.txt.2006_05_16
>
> When map tasks write intermediate data out, they always use SequencialFile RecordWriter with key/value classes from the job object.
> When the reducers write the final results out, its output format is obtained from the job object. By default, it is TextOutputFormat, and no conflicts.
> However, if one wants to use SequencialFileFormat for the final results, then the key/value classes are also obtained from the job object, the same as the map tasks' output. Now we have a problem. It is impossible for the map outputs and reducer outputs use different key/value classes, if one wants the reducers generate outputs in SequentialFileFormat.
> A simple fix would be to add another two attributes to JobConf class: mapOutputLeyClass and mapOutputValueClass. That allows the user to have different key/value classes for the intermediate and final outputs.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
[jira] Closed: (HADOOP-115) permit reduce input types to differ
from reduce output types
Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/HADOOP-115?page=all ]
Doug Cutting closed HADOOP-115:
-------------------------------
> permit reduce input types to differ from reduce output types
> ------------------------------------------------------------
>
> Key: HADOOP-115
> URL: http://issues.apache.org/jira/browse/HADOOP-115
> Project: Hadoop
> Type: New Feature
> Components: mapred
> Reporter: Runping Qi
> Assignee: Runping Qi
> Fix For: 0.3.0
> Attachments: hadoop-115_ReduceTask.patch, hadoop-115_tk.patch, patch_115.txt.2006_05_16
>
> When map tasks write intermediate data out, they always use SequencialFile RecordWriter with key/value classes from the job object.
> When the reducers write the final results out, its output format is obtained from the job object. By default, it is TextOutputFormat, and no conflicts.
> However, if one wants to use SequencialFileFormat for the final results, then the key/value classes are also obtained from the job object, the same as the map tasks' output. Now we have a problem. It is impossible for the map outputs and reducer outputs use different key/value classes, if one wants the reducers generate outputs in SequentialFileFormat.
> A simple fix would be to add another two attributes to JobConf class: mapOutputLeyClass and mapOutputValueClass. That allows the user to have different key/value classes for the intermediate and final outputs.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
[jira] Commented: (HADOOP-115) Hadoop should allow the user to use
SequentialFileOutputformat as the output format and to choose key/value
classes that are different from those for map output.
Posted by "Owen O'Malley (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/HADOOP-115?page=comments#action_12373714 ]
Owen O'Malley commented on HADOOP-115:
--------------------------------------
Your patch adds {set,get}MapOutputComparatorClass(), which aren't needed, because the map outputs are the only ones that are compared.
I don't think you are handling the combiners.
Did you need to change the interfaces to explicitly pass around the key/value types to the getRecordWriter()? Shouldn't getRecordWriter only be called for the reduce outputs?
> Hadoop should allow the user to use SequentialFileOutputformat as the output format and to choose key/value classes that are different from those for map output.
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: HADOOP-115
> URL: http://issues.apache.org/jira/browse/HADOOP-115
> Project: Hadoop
> Type: Improvement
> Components: mapred
> Reporter: Runping Qi
> Attachments: hadoop-115_tk.patch
>
> When map tasks write intermediate data out, they always use SequencialFile RecordWriter with key/value classes from the job object.
> When the reducers write the final results out, its output format is obtained from the job object. By default, it is TextOutputFormat, and no conflicts.
> However, if one wants to use SequencialFileFormat for the final results, then the key/value classes are also obtained from the job object, the same as the map tasks' output. Now we have a problem. It is impossible for the map outputs and reducer outputs use different key/value classes, if one wants the reducers generate outputs in SequentialFileFormat.
> A simple fix would be to add another two attributes to JobConf class: mapOutputLeyClass and mapOutputValueClass. That allows the user to have different key/value classes for the intermediate and final outputs.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
[jira] Commented: (HADOOP-115) Hadoop should allow the user to use
SequentialFileOutputformat as the output format and to choose key/value
classes that are different from those for map output.
Posted by "eric baldeschwieler (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/HADOOP-115?page=comments#action_12372781 ]
eric baldeschwieler commented on HADOOP-115:
--------------------------------------------
+1
But MapOutput seems confusing. Shouldn't these be called FinalOutput or ReduceOutput?
> Hadoop should allow the user to use SequentialFileOutputformat as the output format and to choose key/value classes that are different from those for map output.
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: HADOOP-115
> URL: http://issues.apache.org/jira/browse/HADOOP-115
> Project: Hadoop
> Type: Improvement
> Components: mapred
> Reporter: Runping Qi
>
> When map tasks write intermediate data out, they always use SequencialFile RecordWriter with key/value classes from the job object.
> When the reducers write the final results out, its output format is obtained from the job object. By default, it is TextOutputFormat, and no conflicts.
> However, if one wants to use SequencialFileFormat for the final results, then the key/value classes are also obtained from the job object, the same as the map tasks' output. Now we have a problem. It is impossible for the map outputs and reducer outputs use different key/value classes, if one wants the reducers generate outputs in SequentialFileFormat.
> A simple fix would be to add another two attributes to JobConf class: mapOutputLeyClass and mapOutputValueClass. That allows the user to have different key/value classes for the intermediate and final outputs.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
[jira] Commented: (HADOOP-115) Hadoop should allow the user to use
SequentialFileOutputformat as the output format and to choose key/value
classes that are different from those for map output.
Posted by "Bryan Pendleton (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/HADOOP-115?page=comments#action_12372803 ]
Bryan Pendleton commented on HADOOP-115:
----------------------------------------
I've implemented the above case with the existing code, and it's pretty simple... the output is actually always list(source), or perhaps a new type that encapsulates the list, and it just means that the map actually outputs a list of length 1, and the combiner/reducer concatenates lists.
> Hadoop should allow the user to use SequentialFileOutputformat as the output format and to choose key/value classes that are different from those for map output.
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: HADOOP-115
> URL: http://issues.apache.org/jira/browse/HADOOP-115
> Project: Hadoop
> Type: Improvement
> Components: mapred
> Reporter: Runping Qi
>
> When map tasks write intermediate data out, they always use SequencialFile RecordWriter with key/value classes from the job object.
> When the reducers write the final results out, its output format is obtained from the job object. By default, it is TextOutputFormat, and no conflicts.
> However, if one wants to use SequencialFileFormat for the final results, then the key/value classes are also obtained from the job object, the same as the map tasks' output. Now we have a problem. It is impossible for the map outputs and reducer outputs use different key/value classes, if one wants the reducers generate outputs in SequentialFileFormat.
> A simple fix would be to add another two attributes to JobConf class: mapOutputLeyClass and mapOutputValueClass. That allows the user to have different key/value classes for the intermediate and final outputs.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
[jira] Updated: (HADOOP-115) permit reduce input types to differ
from reduce output types
Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/HADOOP-115?page=all ]
Doug Cutting updated HADOOP-115:
--------------------------------
Summary: permit reduce input types to differ from reduce output types (was: Hadoop should allow the user to use SequentialFileOutputformat as the output format and to choose key/value classes that are different from those for map output.)
type: New Feature (was: Improvement)
Description:
When map tasks write intermediate data out, they always use SequencialFile RecordWriter with key/value classes from the job object.
When the reducers write the final results out, its output format is obtained from the job object. By default, it is TextOutputFormat, and no conflicts.
However, if one wants to use SequencialFileFormat for the final results, then the key/value classes are also obtained from the job object, the same as the map tasks' output. Now we have a problem. It is impossible for the map outputs and reducer outputs use different key/value classes, if one wants the reducers generate outputs in SequentialFileFormat.
A simple fix would be to add another two attributes to JobConf class: mapOutputLeyClass and mapOutputValueClass. That allows the user to have different key/value classes for the intermediate and final outputs.
was:
When map tasks write intermediate data out, they always use SequencialFile RecordWriter with key/value classes from the job object.
When the reducers write the final results out, its output format is obtained from the job object. By default, it is TextOutputFormat, and no conflicts.
However, if one wants to use SequencialFileFormat for the final results, then the key/value classes are also obtained from the job object, the same as the map tasks' output. Now we have a problem. It is impossible for the map outputs and reducer outputs use different key/value classes, if one wants the reducers generate outputs in SequentialFileFormat.
A simple fix would be to add another two attributes to JobConf class: mapOutputLeyClass and mapOutputValueClass. That allows the user to have different key/value classes for the intermediate and final outputs.
> permit reduce input types to differ from reduce output types
> ------------------------------------------------------------
>
> Key: HADOOP-115
> URL: http://issues.apache.org/jira/browse/HADOOP-115
> Project: Hadoop
> Type: New Feature
> Components: mapred
> Reporter: Runping Qi
> Assignee: Runping Qi
> Attachments: hadoop-115_ReduceTask.patch, hadoop-115_tk.patch, patch_115.txt.2006_05_16
>
> When map tasks write intermediate data out, they always use SequencialFile RecordWriter with key/value classes from the job object.
> When the reducers write the final results out, its output format is obtained from the job object. By default, it is TextOutputFormat, and no conflicts.
> However, if one wants to use SequencialFileFormat for the final results, then the key/value classes are also obtained from the job object, the same as the map tasks' output. Now we have a problem. It is impossible for the map outputs and reducer outputs use different key/value classes, if one wants the reducers generate outputs in SequentialFileFormat.
> A simple fix would be to add another two attributes to JobConf class: mapOutputLeyClass and mapOutputValueClass. That allows the user to have different key/value classes for the intermediate and final outputs.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
[jira] Commented: (HADOOP-115) Hadoop should allow the user to use
SequentialFileOutputformat as the output format and to choose key/value
classes that are different from those for map output.
Posted by "Teppo Kurki (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/HADOOP-115?page=comments#action_12372801 ]
Teppo Kurki commented on HADOOP-115:
------------------------------------
" I.e., the input keys and values are drawn from a different
domain than the output keys and values. Furthermore,
the intermediate keys and values are from the same domain
as the output keys and values. "
The Google MapReduce paper states the Web-Link graph example where this is not the case:
"Reverse Web-Link Graph: The map function outputs
(target; source) pairs for each link to a target
URL found in a page named source. The reduce
function concatenates the list of all source URLs associated
with a given target URL and emits the pair:
(target; list(source))".
> Hadoop should allow the user to use SequentialFileOutputformat as the output format and to choose key/value classes that are different from those for map output.
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: HADOOP-115
> URL: http://issues.apache.org/jira/browse/HADOOP-115
> Project: Hadoop
> Type: Improvement
> Components: mapred
> Reporter: Runping Qi
>
> When map tasks write intermediate data out, they always use SequencialFile RecordWriter with key/value classes from the job object.
> When the reducers write the final results out, its output format is obtained from the job object. By default, it is TextOutputFormat, and no conflicts.
> However, if one wants to use SequencialFileFormat for the final results, then the key/value classes are also obtained from the job object, the same as the map tasks' output. Now we have a problem. It is impossible for the map outputs and reducer outputs use different key/value classes, if one wants the reducers generate outputs in SequentialFileFormat.
> A simple fix would be to add another two attributes to JobConf class: mapOutputLeyClass and mapOutputValueClass. That allows the user to have different key/value classes for the intermediate and final outputs.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
[jira] Assigned: (HADOOP-115) Hadoop should allow the user to use
SequentialFileOutputformat as the output format and to choose key/value
classes that are different from those for map output.
Posted by "Owen O'Malley (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/HADOOP-115?page=all ]
Owen O'Malley reassigned HADOOP-115:
------------------------------------
Assign To: Runping Qi
Runping is going to take a look at fixing the patch.
> Hadoop should allow the user to use SequentialFileOutputformat as the output format and to choose key/value classes that are different from those for map output.
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: HADOOP-115
> URL: http://issues.apache.org/jira/browse/HADOOP-115
> Project: Hadoop
> Type: Improvement
> Components: mapred
> Reporter: Runping Qi
> Assignee: Runping Qi
> Attachments: hadoop-115_ReduceTask.patch, hadoop-115_tk.patch
>
> When map tasks write intermediate data out, they always use SequencialFile RecordWriter with key/value classes from the job object.
> When the reducers write the final results out, its output format is obtained from the job object. By default, it is TextOutputFormat, and no conflicts.
> However, if one wants to use SequencialFileFormat for the final results, then the key/value classes are also obtained from the job object, the same as the map tasks' output. Now we have a problem. It is impossible for the map outputs and reducer outputs use different key/value classes, if one wants the reducers generate outputs in SequentialFileFormat.
> A simple fix would be to add another two attributes to JobConf class: mapOutputLeyClass and mapOutputValueClass. That allows the user to have different key/value classes for the intermediate and final outputs.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
[jira] Updated: (HADOOP-115) Hadoop should allow the user to use
SequentialFileOutputformat as the output format and to choose key/value
classes that are different from those for map output.
Posted by "Runping Qi (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/HADOOP-115?page=all ]
Runping Qi updated HADOOP-115:
------------------------------
Attachment: patch_115.txt.2006_05_16
My patch is attached.
Change highlights:
JobConf.java:
Added set/getMapOutputKey/ValueClass methods
Modified getOutputKeyComparatorClass
MapTask.java:
call getMapOutputKey/ValueClass instead of getOutputKey/ValueClass
MapReduce.java:
call getMapOutputKey/ValueClass instead of getOutputKey/ValueClass
MapOutputFile.java:
call getMapOutputKey/ValueClass instead of getOutputKey/ValueClass
I've run a job with a combiner, with UTF8 as the map output value class, and CrawledDoc as the output value class.
The job completed successfully.
> Hadoop should allow the user to use SequentialFileOutputformat as the output format and to choose key/value classes that are different from those for map output.
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: HADOOP-115
> URL: http://issues.apache.org/jira/browse/HADOOP-115
> Project: Hadoop
> Type: Improvement
> Components: mapred
> Reporter: Runping Qi
> Assignee: Runping Qi
> Attachments: hadoop-115_ReduceTask.patch, hadoop-115_tk.patch, patch_115.txt.2006_05_16
>
> When map tasks write intermediate data out, they always use SequencialFile RecordWriter with key/value classes from the job object.
> When the reducers write the final results out, its output format is obtained from the job object. By default, it is TextOutputFormat, and no conflicts.
> However, if one wants to use SequencialFileFormat for the final results, then the key/value classes are also obtained from the job object, the same as the map tasks' output. Now we have a problem. It is impossible for the map outputs and reducer outputs use different key/value classes, if one wants the reducers generate outputs in SequentialFileFormat.
> A simple fix would be to add another two attributes to JobConf class: mapOutputLeyClass and mapOutputValueClass. That allows the user to have different key/value classes for the intermediate and final outputs.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
[jira] Reopened: (HADOOP-115) Hadoop should allow the user to use
SequentialFileOutputformat as the output format and to choose key/value
classes that are different from those for map output.
Posted by "Owen O'Malley (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/HADOOP-115?page=all ]
Owen O'Malley reopened HADOOP-115:
----------------------------------
Let's reopen this. I've had discussions with Runping today, and it seems to me that:
1. It is basically free with respect to the framework.
2. It allows more applications to be written using the framework rather than working around the framework.
3. It is less clear that we should allow the user to change the key type in the reduce, but since the current API does allow them to change the value (if not the type), I think we should be consistent and allow a type change too.
I propose:
1. Add {set,get}MapOutput{Key,Value}Class functions in JobConf.
2. The default values for getMapOutput{Key,Value}Class are the values from getOutput{Key,Value}Class.
3. Always check the types in the output collector rather that the OutputFormat, so that even text output files are check for type correctness.
We should include a javadoc comment for setMapOutputKeyClass will warn that changing the key in the reduce will mean that your output is NOT sorted.
> Hadoop should allow the user to use SequentialFileOutputformat as the output format and to choose key/value classes that are different from those for map output.
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: HADOOP-115
> URL: http://issues.apache.org/jira/browse/HADOOP-115
> Project: Hadoop
> Type: Improvement
> Components: mapred
> Reporter: Runping Qi
>
> When map tasks write intermediate data out, they always use SequencialFile RecordWriter with key/value classes from the job object.
> When the reducers write the final results out, its output format is obtained from the job object. By default, it is TextOutputFormat, and no conflicts.
> However, if one wants to use SequencialFileFormat for the final results, then the key/value classes are also obtained from the job object, the same as the map tasks' output. Now we have a problem. It is impossible for the map outputs and reducer outputs use different key/value classes, if one wants the reducers generate outputs in SequentialFileFormat.
> A simple fix would be to add another two attributes to JobConf class: mapOutputLeyClass and mapOutputValueClass. That allows the user to have different key/value classes for the intermediate and final outputs.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
[jira] Commented: (HADOOP-115) Hadoop should allow the user to use
SequentialFileOutputformat as the output format and to choose key/value
classes that are different from those for map output.
Posted by "Owen O'Malley (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/HADOOP-115?page=comments#action_12372967 ]
Owen O'Malley commented on HADOOP-115:
--------------------------------------
I should have addressed the combiner before. *smile* Of course the combiner input and output has to match the map output types. So, it looks like:
map: k1,v1 -> seq(k2,v2)
combine: k2,seq(v2) -> seq(k2,v2)
reduce: k2, seq(v2) -> seq(k3,v3)
So the only extra code is to set/get the types for k2/v2 (or equivalent k3/v3), although I would recommend adding a type check in the reduce collector. It is completely upward compatible.
As for user confusion, I've already had to explain this restriction (k2==k3 and v2==v3) far more times than I'd like.
On a side note, we could hack around the problem by defining an OutputFormat that uses SequenceFileWriter, but doesn't open the file until the first key/value pair is written and takes the types from the first instances. But that breaks when someone puts the type check into the reduce collector.
> Hadoop should allow the user to use SequentialFileOutputformat as the output format and to choose key/value classes that are different from those for map output.
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: HADOOP-115
> URL: http://issues.apache.org/jira/browse/HADOOP-115
> Project: Hadoop
> Type: Improvement
> Components: mapred
> Reporter: Runping Qi
>
> When map tasks write intermediate data out, they always use SequencialFile RecordWriter with key/value classes from the job object.
> When the reducers write the final results out, its output format is obtained from the job object. By default, it is TextOutputFormat, and no conflicts.
> However, if one wants to use SequencialFileFormat for the final results, then the key/value classes are also obtained from the job object, the same as the map tasks' output. Now we have a problem. It is impossible for the map outputs and reducer outputs use different key/value classes, if one wants the reducers generate outputs in SequentialFileFormat.
> A simple fix would be to add another two attributes to JobConf class: mapOutputLeyClass and mapOutputValueClass. That allows the user to have different key/value classes for the intermediate and final outputs.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
[jira] Updated: (HADOOP-115) Hadoop should allow the user to use
SequentialFileOutputformat as the output format and to choose key/value
classes that are different from those for map output.
Posted by "Teppo Kurki (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/HADOOP-115?page=all ]
Teppo Kurki updated HADOOP-115:
-------------------------------
Attachment: hadoop-115_ReduceTask.patch
Patch including
TestReduceTask
- generates a bunch of SequenceFiles and reduces them by running a single ReduceTask
- two test methods, one where input is just copied to output and one where the Reducer swaps keys and values
- Reducer checks that all generated key-value pairs are reduced by key
- checks that the resulting output file contains what it's supposed to
JobConf
- the necessary set/getMapOutputKey/ValueClass methods
- getOutputComparator uses MapKeyClass if one is specified
ReduceTask
- append and sort phases get the classes from getMapOutput.. methods
This should take care of the Reduce part of the problem. MapTask should be also adjusted accordingly, but since I haven'twritten a test for that I haven't done it yet.
Owen, I didn't get your comment on handling the combiners - doesn't the combiner just use the map OutputCollector underneath and as you put it
map: k1,v1 -> seq(k2,v2)
combine: k2,seq(v2) -> seq(k2,v2)
the outputs are exactly the same, even if the combiner is technically a Reducer?
> Hadoop should allow the user to use SequentialFileOutputformat as the output format and to choose key/value classes that are different from those for map output.
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: HADOOP-115
> URL: http://issues.apache.org/jira/browse/HADOOP-115
> Project: Hadoop
> Type: Improvement
> Components: mapred
> Reporter: Runping Qi
> Attachments: hadoop-115_ReduceTask.patch, hadoop-115_tk.patch
>
> When map tasks write intermediate data out, they always use SequencialFile RecordWriter with key/value classes from the job object.
> When the reducers write the final results out, its output format is obtained from the job object. By default, it is TextOutputFormat, and no conflicts.
> However, if one wants to use SequencialFileFormat for the final results, then the key/value classes are also obtained from the job object, the same as the map tasks' output. Now we have a problem. It is impossible for the map outputs and reducer outputs use different key/value classes, if one wants the reducers generate outputs in SequentialFileFormat.
> A simple fix would be to add another two attributes to JobConf class: mapOutputLeyClass and mapOutputValueClass. That allows the user to have different key/value classes for the intermediate and final outputs.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
[jira] Commented: (HADOOP-115) Hadoop should allow the user to use
SequentialFileOutputformat as the output format and to choose key/value
classes that are different from those for map output.
Posted by "Teppo Kurki (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/HADOOP-115?page=comments#action_12372790 ]
Teppo Kurki commented on HADOOP-115:
------------------------------------
+1
My original post about the issue gives a simple case that would benefit from this: http://www.mail-archive.com/hadoop-user%40lucene.apache.org/msg00073.html
As already said, this would add more transformational power to Hadoop and make certain cases more straightforward.
> Hadoop should allow the user to use SequentialFileOutputformat as the output format and to choose key/value classes that are different from those for map output.
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: HADOOP-115
> URL: http://issues.apache.org/jira/browse/HADOOP-115
> Project: Hadoop
> Type: Improvement
> Components: mapred
> Reporter: Runping Qi
>
> When map tasks write intermediate data out, they always use SequencialFile RecordWriter with key/value classes from the job object.
> When the reducers write the final results out, its output format is obtained from the job object. By default, it is TextOutputFormat, and no conflicts.
> However, if one wants to use SequencialFileFormat for the final results, then the key/value classes are also obtained from the job object, the same as the map tasks' output. Now we have a problem. It is impossible for the map outputs and reducer outputs use different key/value classes, if one wants the reducers generate outputs in SequentialFileFormat.
> A simple fix would be to add another two attributes to JobConf class: mapOutputLeyClass and mapOutputValueClass. That allows the user to have different key/value classes for the intermediate and final outputs.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
[jira] Commented: (HADOOP-115) Hadoop should allow the user to use
SequentialFileOutputformat as the output format and to choose key/value
classes that are different from those for map output.
Posted by "Darek Zbik (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/HADOOP-115?page=comments#action_12372700 ]
Darek Zbik commented on HADOOP-115:
-----------------------------------
New attibutes in the JobConf should denote output from reduce task not from map as suggest a the name mapOutputValueClass.
> Hadoop should allow the user to use SequentialFileOutputformat as the output format and to choose key/value classes that are different from those for map output.
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: HADOOP-115
> URL: http://issues.apache.org/jira/browse/HADOOP-115
> Project: Hadoop
> Type: Improvement
> Components: mapred
> Reporter: Runping Qi
>
> When map tasks write intermediate data out, they always use SequencialFile RecordWriter with key/value classes from the job object.
> When the reducers write the final results out, its output format is obtained from the job object. By default, it is TextOutputFormat, and no conflicts.
> However, if one wants to use SequencialFileFormat for the final results, then the key/value classes are also obtained from the job object, the same as the map tasks' output. Now we have a problem. It is impossible for the map outputs and reducer outputs use different key/value classes, if one wants the reducers generate outputs in SequentialFileFormat.
> A simple fix would be to add another two attributes to JobConf class: mapOutputLeyClass and mapOutputValueClass. That allows the user to have different key/value classes for the intermediate and final outputs.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira