You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by abc xyz <fa...@yahoo.com> on 2010/08/09 17:05:39 UTC

Total order partitioner

The input splits are sampled when we use the total order partitioner. I want to 
know how and when this sampling is done. Is this sampling done before Master 
allocates tasks to the nodes since the sampling file has to be added to 
distributed cache as well. If it is so, is this sampling carried out at master 
node? Then master has to access the input splits for getting the samples?



      

Re: Total order partitioner [Modified]

Posted by Gang Luo <lg...@yahoo.com.cn>.
the sampling is done at the master node by accessing the splits before the job 
is submitted. The partitioner, by default, should only sent one key to one 
partition exclusively, unless you modify it.

-Gang




----- 原始邮件 ----
发件人: abc xyz <fa...@yahoo.com>
收件人: common-user@hadoop.apache.org
发送日期: 2010/8/9 (周一) 11:30:11 上午
主   题: Total order partitioner [Modified]


1) The input splits are sampled when we use the total order partitioner provided 

in Hadoop 0.19. I want to 

know how and when this sampling is done. Is this sampling done before Master 
allocates tasks to the nodes since the sampling file has to be added to 
distributed cache as well. If it is so, is this sampling carried out at master 
node? Then master has to access the input splits for getting the samples?

2) Also, does total order partitioner allow such ranges where a key can  belong 
to more than one ranges? I mean something like this, A, C, D, D,  H, Y where 
keys from A and C sent to one partition, Keys from C to D  sent to 2nd 
partition, Keys with value D can be sent randomly either to  2nd or 3rd 
partition, and so on. or are these ranges mutually exclusive?


      

Total order partitioner [Modified]

Posted by abc xyz <fa...@yahoo.com>.
1) The input splits are sampled when we use the total order partitioner provided 
in Hadoop 0.19. I want to 

know how and when this sampling is done. Is this sampling done before Master 
allocates tasks to the nodes since the sampling file has to be added to 
distributed cache as well. If it is so, is this sampling carried out at master 
node? Then master has to access the input splits for getting the samples?

2) Also, does total order partitioner allow such ranges where a key can  belong 
to more than one ranges? I mean something like this, A, C, D, D,  H, Y where 
keys from A and C sent to one partition, Keys from C to D  sent to 2nd 
partition, Keys with value D can be sent randomly either to  2nd or 3rd 
partition, and so on. or are these ranges mutually exclusive?