You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Daniel Yehdego <dt...@miners.utep.edu> on 2011/09/08 19:07:46 UTC

HADOOP MapReduce sorting

Hi, 
I want to use an input file which has lines of sequences in which each line (RNA sequence) will be mapped to the mapper (an executable programthat determines the secondary structure of each line of sequence). I am also using a reducer which concatenates the output linesfrom the mapper. But I have some problem that the final output is not sorted in an orderly manner as the input sequence (RNA-1,RNA-2,RNA-3....). 
STDIN INPUT FILE : RNA-1                             RNA-2                             RNA-3.....
MAPPER OutPutMAP1<RNA-2><STRUCTURE-2>MAP2<RNA-1><STRUCTURE-1>MAP3<RNA-3><STRUCTURE-3>REDUCER OUTPUT<RNA-2><RNA-1><RNA-3>\t<STRUCTURE-1><STRUCTURE-2><STRUCTURE-3>\n OR<RNA-3><RNA-2><RNA-1>\t<STRUCTURE-1><STRUCTURE-2><STRUCTURE-3>\n
and what I am looking is to reduce in the following ordered manner: <RNA-1><RNA-2><RNA-3>\t<STRUCTURE-1><STRUCTURE-2><STRUCTURE-3>\nlooking forward to your input. 

Regards, 

Daniel T. Yehdego
Computational Science Program 
University of Texas at El Paso, UTEP 
dtyehdego@miners.utep.edu

Re: Ganglia 3.2 and Hadoop .20.2

Posted by robert <ro...@austin.rr.com>.

Sorry to follow up my own post but I thought I would give it one more
shot this morning and change  to dfs.servers=239.2.11.71:8649 (the
multicast address). 

Though I am sure I tried that before, it works this time. 
Perhaps the Ganglia system was in some unusual state before.


On 09/11/11 08:27, robert wrote:
> I downloaded the latest version of Ganglia and compiled and installed
> on my Hadoop cluster. Configured according to the documented
> procedures. The latest stable version of Ganglia is 3.2, and I am
> using hadoop-0.20.2-cdh31
>
> I just copied the gmond.conf from the distribution to the nodes. It
> has what look like default values 239.2.11.71 for mcast_join and port
> 8649 throughout.
>
> The core (non hadoop) Ganglia reporting works fine, but Ganglia is not
> communicating with Hadoop in any reproducible way.  I got reporting on
> one node once, got a *different* node reported from telnet localhost
> 8649 once, but more generally get no reporting of hadoop metrics at
> all!  When I bounce the cluster and/or gmond I may or may not get any
> difference in behavior. It is frustrating because the behavior seems
> to be random and not reproducible.
>
> I wonder if there is a problem with version compatibility?  If there
> were release notes indicating a compatibility issue I didn't see them
> on the ganglia site.  At this point, I'm tempted to give up on Ganglia
> for hadoop metrics and look for alternatives.
>
> Any ideas?
>
>
>
>
>
>

Ganglia 3.2 and Hadoop .20.2

Posted by robert <ro...@austin.rr.com>.

I downloaded the latest version of Ganglia and compiled and installed
on my Hadoop cluster. Configured according to the documented
procedures. The latest stable version of Ganglia is 3.2, and I am
using hadoop-0.20.2-cdh31

I just copied the gmond.conf from the distribution to the nodes. It
has what look like default values 239.2.11.71 for mcast_join and port
8649 throughout.

The core (non hadoop) Ganglia reporting works fine, but Ganglia is not
communicating with Hadoop in any reproducible way.  I got reporting on
one node once, got a *different* node reported from telnet localhost
8649 once, but more generally get no reporting of hadoop metrics at
all!  When I bounce the cluster and/or gmond I may or may not get any
difference in behavior. It is frustrating because the behavior seems
to be random and not reproducible.

I wonder if there is a problem with version compatibility?  If there
were release notes indicating a compatibility issue I didn't see them
on the ganglia site.  At this point, I'm tempted to give up on Ganglia
for hadoop metrics and look for alternatives.

Any ideas?

Re: HADOOP MapReduce sorting

Posted by Mehmet Tepedelenlioglu <me...@gmail.com>.

If you have a set of key value pairs you that you want to have in the same reducer, label them with an index key like so:

<1,RNA1-STRUCT1>
<1,RNA2-STRUCT2>
<1,RNA3-STRUCT3>

In this case RNA1, 2 and 3 with its corresponding structures will end up in the same reducer. So your mappers won't use RNAi as the key, but another grouping key. 

On Sep 8, 2011, at 10:07 AM, Daniel Yehdego wrote:

> 
> Hi, 
> I want to use an input file which has lines of sequences in which each line (RNA sequence) will be mapped to the mapper (an executable programthat determines the secondary structure of each line of sequence). I am also using a reducer which concatenates the output linesfrom the mapper. But I have some problem that the final output is not sorted in an orderly manner as the input sequence (RNA-1,RNA-2,RNA-3....). 
> STDIN INPUT FILE : RNA-1                             RNA-2                             RNA-3.....
> MAPPER OutPutMAP1<RNA-2><STRUCTURE-2>MAP2<RNA-1><STRUCTURE-1>MAP3<RNA-3><STRUCTURE-3>REDUCER OUTPUT<RNA-2><RNA-1><RNA-3>\t<STRUCTURE-1><STRUCTURE-2><STRUCTURE-3>\n OR<RNA-3><RNA-2><RNA-1>\t<STRUCTURE-1><STRUCTURE-2><STRUCTURE-3>\n
> and what I am looking is to reduce in the following ordered manner: <RNA-1><RNA-2><RNA-3>\t<STRUCTURE-1><STRUCTURE-2><STRUCTURE-3>\nlooking forward to your input. 
> 
> Regards, 
> 
> Daniel T. Yehdego
> Computational Science Program 
> University of Texas at El Paso, UTEP 
> dtyehdego@miners.utep.edu