You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by Emmanuel Jeanvoine <em...@inria.fr> on 2010/01/12 15:19:24 UTC
How to use an alternative connector to SSH ?
Hello,
I would to use the Hadoop framework with MapReduce and I have a
question concerning the use of SSH.
Is it possible to use another connector than SSH to launch remote
commands ?
I quickly checked the code but I think this is hardcoded since only
SSH options seem to be customizable.
Regards,
Emmanuel Jeanvoine.
Re: How do I sum by Key in the Reduce Phase AND keep the initial
value
Posted by Amogh Vasekar <am...@yahoo-inc.com>.
Hi Stephen,
I'm pretty sure the re-iterable reducer works by storing <k,v> in memory and spilling to disk once a certain threshold limit is reached. I'm don't know how they decide the limit though ( probably a parameter like io.sort.mb? ), but the patch will throw some light on this.
The pattern you described was what I meant with a successor map only job, but using HOD and shared cluster had some other associated issues, so I preferred writing to file instead.
Thanks,
Amogh
On 1/13/10 2:24 AM, "Stephen Watt" <sw...@us.ibm.com> wrote:
Thanks for responding Amogh.
I'm using Hadoop 0.20.1 and see by the JIRA you mentioned its resolved in 0.21. Bummer... I've thought about the same thing you mentioned however, its my understanding that keeping those values or records in memory is dangerous as you can run out of memory depending on how many values you have (and I have a big dataset). Really, what I am trying to understand here is the Map Reduce Pattern for solving this type of problem. I think until we have a reduce values iterator we can move through more than once, I believe the pattern would be:
1) Have the first job simply store the key and the sum value
2) By using the same keys, one would have the second job append the value from the first job to each record in the reducer. This would be achieved by FIRST going to the HDFS and looking up the value for the key from the first job and then iterating through the values for all the keys on the second job and appending the sum value to each record.
Kind regards
Steve Watt
From: Amogh Vasekar <am...@yahoo-inc.com>
To: "mapreduce-user@hadoop.apache.org" <ma...@hadoop.apache.org>
Date: 01/12/2010 02:01 PM
Subject: Re: How do I sum by Key in the Reduce Phase AND keep the initial value
________________________________
Hi,
I ran into a very similar situation quite some time back and had then encountered this : http://issues.apache.org/jira/browse/HADOOP-475 <http://issues.apache.org/jira/browse/HADOOP-475>
After speaking to a few Hadoop folks, they had said complete cloning was not a straightforward option for some optimization reasons.
There were a few things I tried , to run this in a single MR job emitting <k,v> from mapper one more time with some tagging info ( this bumped up S&S phase by quite a lot ); run a map only successor job etc. But keeping records in memory and writing to disk after certain threshold amount worked pretty well for me ( all this on Hadoop 0.17.2 )
Anyways, they seem to have resolved it in next Hadoop release.
Amogh
On 1/12/10 10:29 PM, "Stephen Watt" <swatt@us.ibm.com <sw...@us.ibm.com> > wrote:
The Key Value pairs coming into my Reducer are as Follows
KEY(Text) VALUE(IntWritable)
A 11
A 9
B 2
B 3
I want my reducer to sum the Values for each input key and then output the key with a Text Value containing the original value and the sum.
KEY(Text) VALUE(Text)
A 11 20
A 9 20
B 2 5
B 3 5
Here is the issue : In the reducer, I am iterating through the values for each using values.iterator() and storing the total amount in a variable. Then I am TRYING to iterate through the keys again, except this time, writing the new value (A, new Text("11 20") in the output collector to create the Value structure displayed in the example above. This fails because it appears I can only iterate through the values for each key ONCE. I know this because additional attempts to get new iterators from the context or the Iterable type thats passed into the reducer always return false on the initial hasNext().
I have to iterate through it twice because the first time I have to sum the values and the second time I need to write the write the initial (11) value and the sum(20) as I need both values as part of a calculation in the next job. Any ideas on how to do this ?
Kind regards
Steve Watt
Re: How do I sum by Key in the Reduce Phase AND keep the initial value
Posted by Stephen Watt <sw...@us.ibm.com>.
Thanks for responding Amogh.
I'm using Hadoop 0.20.1 and see by the JIRA you mentioned its resolved in
0.21. Bummer... I've thought about the same thing you mentioned however,
its my understanding that keeping those values or records in memory is
dangerous as you can run out of memory depending on how many values you
have (and I have a big dataset). Really, what I am trying to understand
here is the Map Reduce Pattern for solving this type of problem. I think
until we have a reduce values iterator we can move through more than once,
I believe the pattern would be:
1) Have the first job simply store the key and the sum value
2) By using the same keys, one would have the second job append the value
from the first job to each record in the reducer. This would be achieved
by FIRST going to the HDFS and looking up the value for the key from the
first job and then iterating through the values for all the keys on the
second job and appending the sum value to each record.
Kind regards
Steve Watt
From:
Amogh Vasekar <am...@yahoo-inc.com>
To:
"mapreduce-user@hadoop.apache.org" <ma...@hadoop.apache.org>
Date:
01/12/2010 02:01 PM
Subject:
Re: How do I sum by Key in the Reduce Phase AND keep the initial value
Hi,
I ran into a very similar situation quite some time back and had then
encountered this : http://issues.apache.org/jira/browse/HADOOP-475
After speaking to a few Hadoop folks, they had said complete cloning was
not a straightforward option for some optimization reasons.
There were a few things I tried , to run this in a single MR job emitting
<k,v> from mapper one more time with some tagging info ( this bumped up
S&S phase by quite a lot ); run a map only successor job etc. But keeping
records in memory and writing to disk after certain threshold amount
worked pretty well for me ( all this on Hadoop 0.17.2 )
Anyways, they seem to have resolved it in next Hadoop release.
Amogh
On 1/12/10 10:29 PM, "Stephen Watt" <sw...@us.ibm.com> wrote:
The Key Value pairs coming into my Reducer are as Follows
KEY(Text) VALUE(IntWritable)
A 11
A 9
B 2
B 3
I want my reducer to sum the Values for each input key and then output the
key with a Text Value containing the original value and the sum.
KEY(Text) VALUE(Text)
A 11 20
A 9 20
B 2 5
B 3 5
Here is the issue : In the reducer, I am iterating through the values for
each using values.iterator() and storing the total amount in a variable.
Then I am TRYING to iterate through the keys again, except this time,
writing the new value (A, new Text("11 20") in the output collector to
create the Value structure displayed in the example above. This fails
because it appears I can only iterate through the values for each key
ONCE. I know this because additional attempts to get new iterators from
the context or the Iterable type thats passed into the reducer always
return false on the initial hasNext().
I have to iterate through it twice because the first time I have to sum
the values and the second time I need to write the write the initial (11)
value and the sum(20) as I need both values as part of a calculation in
the next job. Any ideas on how to do this ?
Kind regards
Steve Watt
Re: How do I sum by Key in the Reduce Phase AND keep the initial
value
Posted by Amogh Vasekar <am...@yahoo-inc.com>.
Hi,
I ran into a very similar situation quite some time back and had then encountered this : http://issues.apache.org/jira/browse/HADOOP-475
After speaking to a few Hadoop folks, they had said complete cloning was not a straightforward option for some optimization reasons.
There were a few things I tried , to run this in a single MR job emitting <k,v> from mapper one more time with some tagging info ( this bumped up S&S phase by quite a lot ); run a map only successor job etc. But keeping records in memory and writing to disk after certain threshold amount worked pretty well for me ( all this on Hadoop 0.17.2 )
Anyways, they seem to have resolved it in next Hadoop release.
Amogh
On 1/12/10 10:29 PM, "Stephen Watt" <sw...@us.ibm.com> wrote:
The Key Value pairs coming into my Reducer are as Follows
KEY(Text) VALUE(IntWritable)
A 11
A 9
B 2
B 3
I want my reducer to sum the Values for each input key and then output the key with a Text Value containing the original value and the sum.
KEY(Text) VALUE(Text)
A 11 20
A 9 20
B 2 5
B 3 5
Here is the issue : In the reducer, I am iterating through the values for each using values.iterator() and storing the total amount in a variable. Then I am TRYING to iterate through the keys again, except this time, writing the new value (A, new Text("11 20") in the output collector to create the Value structure displayed in the example above. This fails because it appears I can only iterate through the values for each key ONCE. I know this because additional attempts to get new iterators from the context or the Iterable type thats passed into the reducer always return false on the initial hasNext().
I have to iterate through it twice because the first time I have to sum the values and the second time I need to write the write the initial (11) value and the sum(20) as I need both values as part of a calculation in the next job. Any ideas on how to do this ?
Kind regards
Steve Watt
How do I sum by Key in the Reduce Phase AND keep the initial value
Posted by Stephen Watt <sw...@us.ibm.com>.
The Key Value pairs coming into my Reducer are as Follows
KEY(Text) VALUE(IntWritable)
A 11
A 9
B 2
B 3
I want my reducer to sum the Values for each input key and then output the
key with a Text Value containing the original value and the sum.
KEY(Text) VALUE(Text)
A 11 20
A 9 20
B 2 5
B 3 5
Here is the issue : In the reducer, I am iterating through the values for
each using values.iterator() and storing the total amount in a variable.
Then I am TRYING to iterate through the keys again, except this time,
writing the new value (A, new Text("11 20") in the output collector to
create the Value structure displayed in the example above. This fails
because it appears I can only iterate through the values for each key
ONCE. I know this because additional attempts to get new iterators from
the context or the Iterable type thats passed into the reducer always
return false on the initial hasNext().
I have to iterate through it twice because the first time I have to sum
the values and the second time I need to write the write the initial (11)
value and the sum(20) as I need both values as part of a calculation in
the next job. Any ideas on how to do this ?
Kind regards
Steve Watt
Re: How to use an alternative connector to SSH ?
Posted by Eric Sammer <er...@lifeless.net>.
On 1/12/10 9:19 AM, Emmanuel Jeanvoine wrote:
> Hello,
>
> I would to use the Hadoop framework with MapReduce and I have a
> question concerning the use of SSH.
> Is it possible to use another connector than SSH to launch remote
> commands ?
>
> I quickly checked the code but I think this is hardcoded since only
> SSH options seem to be customizable.
Emmanuel:
You're correct in that ssh is hard coded in the start-*.sh scripts. You
can either use Cloudera's distribution (which comes with init scripts
that run on each node like other services) or you can roll your own
start up scripts and invoke the underlying hadoop-daemon.sh scripts on
each node over whatever communication channel you'd like. You may have
to do a little environment setup first if you choose to go this route.
Take a look at the source of start-*.sh; they're pretty simple.
Hope this helps.
--
Eric Sammer
eric@lifeless.net
http://esammer.blogspot.com