You are viewing a plain text version of this content. The canonical link for it is here.

Posted to hdfs-user@hadoop.apache.org by Adeel Qureshi <ad...@gmail.com> on 2013/08/30 01:23:01 UTC

secondary sort - number of reducers

I have implemented secondary sort in my MR job and for some reason if i
dont specify the number of reducers it uses 1 which doesnt seems right
because im working with 800M+ records and one reducer slows things down
significantly. Is this some kind of limitation with the secondary sort that
it has to use a single reducer .. that kind of would defeat the purpose of
having a scalable solution such as secondary sort. I would appreciate any
help.

Thanks
Adeel

RE: secondary sort - number of reducers

Posted by java8964 java8964 <ja...@hotmail.com>.

The method getPartition() needs to return a positive number. Simply use hashCode() method is not enough.
See the Hadoop HashPartitioner implementation:








return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
When I first read this code, I always wonder why not use Math.abs? Is ( & Integer.MAX_VALUE) faster?
Yong
Date: Thu, 29 Aug 2013 20:55:46 -0400
Subject: Re: secondary sort - number of reducers
From: adeelmahmood@gmail.com
To: user@hadoop.apache.org

okay so when i specify the number of reducers e.g. in my example i m using 4 (for a much smaller data set) it works if I use a single column in my composite key .. but if I add multiple columns in the composite key separated by a delimi .. it then throws the illegal partition error (keys before the pipe are group keys and after the pipe are the sort keys and my partioner only uses the group keys

java.io.IOException: Illegal partition for Atlanta:GA|Atlanta:GA:1:Adeel (-1)        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1073)        at org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:691)
        at org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)        at com.att.hadoop.hivesort.HSMapper.map(HSMapper.java:39)        at com.att.hadoop.hivesort.HSMapper.map(HSMapper.java:1)
        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
        at org.apache.hadoop.mapred.Child$4.run(Child.java:255)        at java.security.AccessController.doPrivileged(Native Method)        at javax.security.auth.Subject.doAs(Subject.java:396)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1136)        at org.apache.hadoop.mapred.Child.main(Child.java:249)


	public int getPartition(Text key, HCatRecord record, int numParts) {		//extract the group key from composite key
		String groupKey = key.toString().split("\\|")[0];				return groupKey.hashCode() % numParts;

	}

On Thu, Aug 29, 2013 at 8:31 PM, Shekhar Sharma <sh...@gmail.com> wrote:

No...partitionr decides which keys should go to which reducer...and


number of reducers you need to decide...No of reducers depends on

factors like number of key value pair, use case etc

Regards,

Som Shekhar Sharma

+91-8197243810





On Fri, Aug 30, 2013 at 5:54 AM, Adeel Qureshi <ad...@gmail.com> wrote:

> so it cant figure out an appropriate number of reducers as it does for

> mappers .. in my case hadoop is using 2100+ mappers and then only 1 reducer

> .. since im overriding the partitioner class shouldnt that decide how

> manyredeucers there should be based on how many different partition values

> being returned by the custom partiotioner

>

>

> On Thu, Aug 29, 2013 at 7:38 PM, Ian Wrigley <ia...@cloudera.com> wrote:

>>

>> If you don't specify the number of Reducers, Hadoop will use the default

>> -- which, unless you've changed it, is 1.

>>

>> Regards

>>

>> Ian.

>>

>> On Aug 29, 2013, at 4:23 PM, Adeel Qureshi <ad...@gmail.com> wrote:

>>

>> I have implemented secondary sort in my MR job and for some reason if i

>> dont specify the number of reducers it uses 1 which doesnt seems right

>> because im working with 800M+ records and one reducer slows things down

>> significantly. Is this some kind of limitation with the secondary sort that

>> it has to use a single reducer .. that kind of would defeat the purpose of

>> having a scalable solution such as secondary sort. I would appreciate any

>> help.

>>

>> Thanks

>> Adeel

>>

>>

>>

>> ---

>> Ian Wrigley

>> Sr. Curriculum Manager

>> Cloudera, Inc

>> Cell: (323) 819 4075

>>

>

Re: secondary sort - number of reducers

Posted by Ravi Kiran <ra...@gmail.com>.

Adeel,
   To add to Yong's  points
a)   Consider tuning the number of threads in reduce tasks and the task
tracker process.  mapred.reduce.parallel.copies
b)   See if the map output can be compressed to ensure there is less IO .
c)   Increase the io.sort.factor to ensure the framework merges a larger
number of files in each merge sort at the reducer
d)   Check the counter "Reduce Shuffle Bytes" of  each reducer to see any
skew of data at few reducers. Try for a even distribution of load through a
better partitioner code.

Regards
Ravi Magham


On Fri, Aug 30, 2013 at 9:28 PM, java8964 java8964 <ja...@hotmail.com>wrote:

> Well, The reducers normally will take much longer than the mappers stage,
> because the copy/shuffle/sort all happened at this time, and they are the
> hard part.
>
> But before we simply say it is part of life, you need to dig into more of
> your MR jobs to find out if you can make it faster.
>
> You are the person most familiar with your data, and you wrote the code to
> group/partition them, and send them to the reducers. Even you set up 255
> reducers, the question is, do each of them get its fair share?
> You need to read the COUNTER information of each reducer, and found out
> how many reducer groups each reducer gets, and how many input bytes it get,
> etc.
>
> Simple example, if you send 200G data, and group them by DATE, if all the
> data belongs to 2 days, and one of them contains 90% of data, then in this
> case, giving 255 reducers won't help, as only 2 reducers will consume data,
> and one of them will consume 90% of data, and will finish in a very long
> time, which WILL delay the whole MR job, while the rest reducers will
> finish within seconds. In this case, maybe you need to rethink what should
> be your key, and make sure each reducer get its fair share of volume of
> data.
>
> After the above fix (in fact, normally it will fix 90% of reducer
> performance problems, especially you have 255 reducer tasks available, so
> each one average will only get 1G data, good for your huge cluster only
> needs to process 256G data :-), if you want to make it even faster, then
> check you code. Do you have to use String.compareTo()? Is it slow?  Google
> hadoop rawcomparator to see if you can do something here.
>
> After that, if you still think the reducer stage slow, check you cluster
> system. Does the reducer spend most time on copy stage, or sort, or in your
> reducer class? Find out the where the time spends, then identify the
> solution.
>
> Yong
>
> ------------------------------
> Date: Fri, 30 Aug 2013 11:02:05 -0400
>
> Subject: Re: secondary sort - number of reducers
> From: adeelmahmood@gmail.com
> To: user@hadoop.apache.org
>
>
>
> my secondary sort on multiple keys seem to work fine with smaller data
> sets but with bigger data sets (like 256 gig and 800M+ records) the mapper
> phase gets done pretty quick (about 15 mins) but then the reducer phase
> seem to take forever. I am using 255 reducers.
>
> basic idea is that my composite key has both group and sort keys in it
> which i parse in the appropriate comparator classes to perform grouping and
> sorting .. my thinking is that mappers is where most of the work is done
> 1. mapper itself (create composite key and value)
> 2. recods sorting
> 3. partiotioner
>
> if all this gets done in 15 mins then reducer has the simple task of
> 1. grouping comparator
> 2. reducer itself (simply output records)
>
> should take less time than mappers .. instead it essentially gets stuck in
> reduce phase .. im gonna paste my code here to see if anything stands out
> as a fundamental design issue
>
> //////PARTITIONER
> public int getPartition(Text key, HCatRecord record, int numReduceTasks) {
> //extract the group key from composite key
>  String groupKey = key.toString().split("\\|")[0];
> return (groupKey.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
>  }
>
>
> ////////////GROUP COMAPRATOR
> public int compare(WritableComparable a, WritableComparable b) {
> //compare to text objects
>  String thisGroupKey = ((Text) a).toString().split("\\|")[0];
> String otherGroupKey = ((Text) b).toString().split("\\|")[0];
>  //extract
> return thisGroupKey.compareTo(otherGroupKey);
> }
>
>
> ////////////SORT COMPARATOR
> is similar to group comparator and is in map phase and gets done quick
>
>
>
> //////////REDUCER
> public void reduce(Text key, Iterable<HCatRecord> records, Context
> context) throws IOException, InterruptedException {
> log.info("in reducer for key " + key.toString());
>  Iterator<HCatRecord> recordsIter = records.iterator();
> //we are only interested in the first record after sorting and grouping
>  if(recordsIter.hasNext()){
> HCatRecord rec = recordsIter.next();
> context.write(nw, rec);
>  log.info("returned record >> " + rec.toString());
> }
> }
>
>
> On Fri, Aug 30, 2013 at 9:24 AM, Adeel Qureshi <ad...@gmail.com>wrote:
>
> yup it was negative and by doing this now it seems to be working fine
>
>
> On Fri, Aug 30, 2013 at 3:09 AM, Shekhar Sharma <sh...@gmail.com>wrote:
>
> Is the hash code of that key  is negative.?
> Do something like this
>
> return groupKey.hashCode() & Integer.MAX_VALUE % numParts;
>
> Regards,
> Som Shekhar Sharma
> +91-8197243810
>
>
> On Fri, Aug 30, 2013 at 6:25 AM, Adeel Qureshi <ad...@gmail.com>
> wrote:
> > okay so when i specify the number of reducers e.g. in my example i m
> using 4
> > (for a much smaller data set) it works if I use a single column in my
> > composite key .. but if I add multiple columns in the composite key
> > separated by a delimi .. it then throws the illegal partition error (keys
> > before the pipe are group keys and after the pipe are the sort keys and
> my
> > partioner only uses the group keys
> >
> > java.io.IOException: Illegal partition for Atlanta:GA|Atlanta:GA:1:Adeel
> > (-1)
> >         at
> >
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1073)
> >         at
> >
> org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:691)
> >         at
> >
> org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
> >         at com.att.hadoop.hivesort.HSMapper.map(HSMapper.java:39)
> >         at com.att.hadoop.hivesort.HSMapper.map(HSMapper.java:1)
> >         at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
> >         at
> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
> >         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
> >         at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
> >         at java.security.AccessController.doPrivileged(Native Method)
> >         at javax.security.auth.Subject.doAs(Subject.java:396)
> >         at
> >
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1136)
> >         at org.apache.hadoop.mapred.Child.main(Child.java:249)
> >
> >
> > public int getPartition(Text key, HCatRecord record, int numParts) {
> > //extract the group key from composite key
> > String groupKey = key.toString().split("\\|")[0];
> > return groupKey.hashCode() % numParts;
> > }
> >
> >
> > On Thu, Aug 29, 2013 at 8:31 PM, Shekhar Sharma <sh...@gmail.com>
> > wrote:
> >>
> >> No...partitionr decides which keys should go to which reducer...and
> >> number of reducers you need to decide...No of reducers depends on
> >> factors like number of key value pair, use case etc
> >> Regards,
> >> Som Shekhar Sharma
> >> +91-8197243810
> >>
> >>
> >> On Fri, Aug 30, 2013 at 5:54 AM, Adeel Qureshi <ad...@gmail.com>
> >> wrote:
> >> > so it cant figure out an appropriate number of reducers as it does for
> >> > mappers .. in my case hadoop is using 2100+ mappers and then only 1
> >> > reducer
> >> > .. since im overriding the partitioner class shouldnt that decide how
> >> > manyredeucers there should be based on how many different partition
> >> > values
> >> > being returned by the custom partiotioner
> >> >
> >> >
> >> > On Thu, Aug 29, 2013 at 7:38 PM, Ian Wrigley <ia...@cloudera.com>
> wrote:
> >> >>
> >> >> If you don't specify the number of Reducers, Hadoop will use the
> >> >> default
> >> >> -- which, unless you've changed it, is 1.
> >> >>
> >> >> Regards
> >> >>
> >> >> Ian.
> >> >>
> >> >> On Aug 29, 2013, at 4:23 PM, Adeel Qureshi <ad...@gmail.com>
> >> >> wrote:
> >> >>
> >> >> I have implemented secondary sort in my MR job and for some reason
> if i
> >> >> dont specify the number of reducers it uses 1 which doesnt seems
> right
> >> >> because im working with 800M+ records and one reducer slows things
> down
> >> >> significantly. Is this some kind of limitation with the secondary
> sort
> >> >> that
> >> >> it has to use a single reducer .. that kind of would defeat the
> purpose
> >> >> of
> >> >> having a scalable solution such as secondary sort. I would appreciate
> >> >> any
> >> >> help.
> >> >>
> >> >> Thanks
> >> >> Adeel
> >> >>
> >> >>
> >> >>
> >> >> ---
> >> >> Ian Wrigley
> >> >> Sr. Curriculum Manager
> >> >> Cloudera, Inc
> >> >> Cell: (323) 819 4075
> >> >>
> >> >
> >
> >
>
>
>
>

Re: secondary sort - number of reducers

Posted by Ravi Kiran <ra...@gmail.com>.

Adeel,
   To add to Yong's  points
a)   Consider tuning the number of threads in reduce tasks and the task
tracker process.  mapred.reduce.parallel.copies
b)   See if the map output can be compressed to ensure there is less IO .
c)   Increase the io.sort.factor to ensure the framework merges a larger
number of files in each merge sort at the reducer
d)   Check the counter "Reduce Shuffle Bytes" of  each reducer to see any
skew of data at few reducers. Try for a even distribution of load through a
better partitioner code.

Regards
Ravi Magham


On Fri, Aug 30, 2013 at 9:28 PM, java8964 java8964 <ja...@hotmail.com>wrote:

> Well, The reducers normally will take much longer than the mappers stage,
> because the copy/shuffle/sort all happened at this time, and they are the
> hard part.
>
> But before we simply say it is part of life, you need to dig into more of
> your MR jobs to find out if you can make it faster.
>
> You are the person most familiar with your data, and you wrote the code to
> group/partition them, and send them to the reducers. Even you set up 255
> reducers, the question is, do each of them get its fair share?
> You need to read the COUNTER information of each reducer, and found out
> how many reducer groups each reducer gets, and how many input bytes it get,
> etc.
>
> Simple example, if you send 200G data, and group them by DATE, if all the
> data belongs to 2 days, and one of them contains 90% of data, then in this
> case, giving 255 reducers won't help, as only 2 reducers will consume data,
> and one of them will consume 90% of data, and will finish in a very long
> time, which WILL delay the whole MR job, while the rest reducers will
> finish within seconds. In this case, maybe you need to rethink what should
> be your key, and make sure each reducer get its fair share of volume of
> data.
>
> After the above fix (in fact, normally it will fix 90% of reducer
> performance problems, especially you have 255 reducer tasks available, so
> each one average will only get 1G data, good for your huge cluster only
> needs to process 256G data :-), if you want to make it even faster, then
> check you code. Do you have to use String.compareTo()? Is it slow?  Google
> hadoop rawcomparator to see if you can do something here.
>
> After that, if you still think the reducer stage slow, check you cluster
> system. Does the reducer spend most time on copy stage, or sort, or in your
> reducer class? Find out the where the time spends, then identify the
> solution.
>
> Yong
>
> ------------------------------
> Date: Fri, 30 Aug 2013 11:02:05 -0400
>
> Subject: Re: secondary sort - number of reducers
> From: adeelmahmood@gmail.com
> To: user@hadoop.apache.org
>
>
>
> my secondary sort on multiple keys seem to work fine with smaller data
> sets but with bigger data sets (like 256 gig and 800M+ records) the mapper
> phase gets done pretty quick (about 15 mins) but then the reducer phase
> seem to take forever. I am using 255 reducers.
>
> basic idea is that my composite key has both group and sort keys in it
> which i parse in the appropriate comparator classes to perform grouping and
> sorting .. my thinking is that mappers is where most of the work is done
> 1. mapper itself (create composite key and value)
> 2. recods sorting
> 3. partiotioner
>
> if all this gets done in 15 mins then reducer has the simple task of
> 1. grouping comparator
> 2. reducer itself (simply output records)
>
> should take less time than mappers .. instead it essentially gets stuck in
> reduce phase .. im gonna paste my code here to see if anything stands out
> as a fundamental design issue
>
> //////PARTITIONER
> public int getPartition(Text key, HCatRecord record, int numReduceTasks) {
> //extract the group key from composite key
>  String groupKey = key.toString().split("\\|")[0];
> return (groupKey.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
>  }
>
>
> ////////////GROUP COMAPRATOR
> public int compare(WritableComparable a, WritableComparable b) {
> //compare to text objects
>  String thisGroupKey = ((Text) a).toString().split("\\|")[0];
> String otherGroupKey = ((Text) b).toString().split("\\|")[0];
>  //extract
> return thisGroupKey.compareTo(otherGroupKey);
> }
>
>
> ////////////SORT COMPARATOR
> is similar to group comparator and is in map phase and gets done quick
>
>
>
> //////////REDUCER
> public void reduce(Text key, Iterable<HCatRecord> records, Context
> context) throws IOException, InterruptedException {
> log.info("in reducer for key " + key.toString());
>  Iterator<HCatRecord> recordsIter = records.iterator();
> //we are only interested in the first record after sorting and grouping
>  if(recordsIter.hasNext()){
> HCatRecord rec = recordsIter.next();
> context.write(nw, rec);
>  log.info("returned record >> " + rec.toString());
> }
> }
>
>
> On Fri, Aug 30, 2013 at 9:24 AM, Adeel Qureshi <ad...@gmail.com>wrote:
>
> yup it was negative and by doing this now it seems to be working fine
>
>
> On Fri, Aug 30, 2013 at 3:09 AM, Shekhar Sharma <sh...@gmail.com>wrote:
>
> Is the hash code of that key  is negative.?
> Do something like this
>
> return groupKey.hashCode() & Integer.MAX_VALUE % numParts;
>
> Regards,
> Som Shekhar Sharma
> +91-8197243810
>
>
> On Fri, Aug 30, 2013 at 6:25 AM, Adeel Qureshi <ad...@gmail.com>
> wrote:
> > okay so when i specify the number of reducers e.g. in my example i m
> using 4
> > (for a much smaller data set) it works if I use a single column in my
> > composite key .. but if I add multiple columns in the composite key
> > separated by a delimi .. it then throws the illegal partition error (keys
> > before the pipe are group keys and after the pipe are the sort keys and
> my
> > partioner only uses the group keys
> >
> > java.io.IOException: Illegal partition for Atlanta:GA|Atlanta:GA:1:Adeel
> > (-1)
> >         at
> >
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1073)
> >         at
> >
> org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:691)
> >         at
> >
> org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
> >         at com.att.hadoop.hivesort.HSMapper.map(HSMapper.java:39)
> >         at com.att.hadoop.hivesort.HSMapper.map(HSMapper.java:1)
> >         at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
> >         at
> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
> >         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
> >         at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
> >         at java.security.AccessController.doPrivileged(Native Method)
> >         at javax.security.auth.Subject.doAs(Subject.java:396)
> >         at
> >
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1136)
> >         at org.apache.hadoop.mapred.Child.main(Child.java:249)
> >
> >
> > public int getPartition(Text key, HCatRecord record, int numParts) {
> > //extract the group key from composite key
> > String groupKey = key.toString().split("\\|")[0];
> > return groupKey.hashCode() % numParts;
> > }
> >
> >
> > On Thu, Aug 29, 2013 at 8:31 PM, Shekhar Sharma <sh...@gmail.com>
> > wrote:
> >>
> >> No...partitionr decides which keys should go to which reducer...and
> >> number of reducers you need to decide...No of reducers depends on
> >> factors like number of key value pair, use case etc
> >> Regards,
> >> Som Shekhar Sharma
> >> +91-8197243810
> >>
> >>
> >> On Fri, Aug 30, 2013 at 5:54 AM, Adeel Qureshi <ad...@gmail.com>
> >> wrote:
> >> > so it cant figure out an appropriate number of reducers as it does for
> >> > mappers .. in my case hadoop is using 2100+ mappers and then only 1
> >> > reducer
> >> > .. since im overriding the partitioner class shouldnt that decide how
> >> > manyredeucers there should be based on how many different partition
> >> > values
> >> > being returned by the custom partiotioner
> >> >
> >> >
> >> > On Thu, Aug 29, 2013 at 7:38 PM, Ian Wrigley <ia...@cloudera.com>
> wrote:
> >> >>
> >> >> If you don't specify the number of Reducers, Hadoop will use the
> >> >> default
> >> >> -- which, unless you've changed it, is 1.
> >> >>
> >> >> Regards
> >> >>
> >> >> Ian.
> >> >>
> >> >> On Aug 29, 2013, at 4:23 PM, Adeel Qureshi <ad...@gmail.com>
> >> >> wrote:
> >> >>
> >> >> I have implemented secondary sort in my MR job and for some reason
> if i
> >> >> dont specify the number of reducers it uses 1 which doesnt seems
> right
> >> >> because im working with 800M+ records and one reducer slows things
> down
> >> >> significantly. Is this some kind of limitation with the secondary
> sort
> >> >> that
> >> >> it has to use a single reducer .. that kind of would defeat the
> purpose
> >> >> of
> >> >> having a scalable solution such as secondary sort. I would appreciate
> >> >> any
> >> >> help.
> >> >>
> >> >> Thanks
> >> >> Adeel
> >> >>
> >> >>
> >> >>
> >> >> ---
> >> >> Ian Wrigley
> >> >> Sr. Curriculum Manager
> >> >> Cloudera, Inc
> >> >> Cell: (323) 819 4075
> >> >>
> >> >
> >
> >
>
>
>
>

Re: secondary sort - number of reducers

Posted by Ravi Kiran <ra...@gmail.com>.

Adeel,
   To add to Yong's  points
a)   Consider tuning the number of threads in reduce tasks and the task
tracker process.  mapred.reduce.parallel.copies
b)   See if the map output can be compressed to ensure there is less IO .
c)   Increase the io.sort.factor to ensure the framework merges a larger
number of files in each merge sort at the reducer
d)   Check the counter "Reduce Shuffle Bytes" of  each reducer to see any
skew of data at few reducers. Try for a even distribution of load through a
better partitioner code.

Regards
Ravi Magham


On Fri, Aug 30, 2013 at 9:28 PM, java8964 java8964 <ja...@hotmail.com>wrote:

> Well, The reducers normally will take much longer than the mappers stage,
> because the copy/shuffle/sort all happened at this time, and they are the
> hard part.
>
> But before we simply say it is part of life, you need to dig into more of
> your MR jobs to find out if you can make it faster.
>
> You are the person most familiar with your data, and you wrote the code to
> group/partition them, and send them to the reducers. Even you set up 255
> reducers, the question is, do each of them get its fair share?
> You need to read the COUNTER information of each reducer, and found out
> how many reducer groups each reducer gets, and how many input bytes it get,
> etc.
>
> Simple example, if you send 200G data, and group them by DATE, if all the
> data belongs to 2 days, and one of them contains 90% of data, then in this
> case, giving 255 reducers won't help, as only 2 reducers will consume data,
> and one of them will consume 90% of data, and will finish in a very long
> time, which WILL delay the whole MR job, while the rest reducers will
> finish within seconds. In this case, maybe you need to rethink what should
> be your key, and make sure each reducer get its fair share of volume of
> data.
>
> After the above fix (in fact, normally it will fix 90% of reducer
> performance problems, especially you have 255 reducer tasks available, so
> each one average will only get 1G data, good for your huge cluster only
> needs to process 256G data :-), if you want to make it even faster, then
> check you code. Do you have to use String.compareTo()? Is it slow?  Google
> hadoop rawcomparator to see if you can do something here.
>
> After that, if you still think the reducer stage slow, check you cluster
> system. Does the reducer spend most time on copy stage, or sort, or in your
> reducer class? Find out the where the time spends, then identify the
> solution.
>
> Yong
>
> ------------------------------
> Date: Fri, 30 Aug 2013 11:02:05 -0400
>
> Subject: Re: secondary sort - number of reducers
> From: adeelmahmood@gmail.com
> To: user@hadoop.apache.org
>
>
>
> my secondary sort on multiple keys seem to work fine with smaller data
> sets but with bigger data sets (like 256 gig and 800M+ records) the mapper
> phase gets done pretty quick (about 15 mins) but then the reducer phase
> seem to take forever. I am using 255 reducers.
>
> basic idea is that my composite key has both group and sort keys in it
> which i parse in the appropriate comparator classes to perform grouping and
> sorting .. my thinking is that mappers is where most of the work is done
> 1. mapper itself (create composite key and value)
> 2. recods sorting
> 3. partiotioner
>
> if all this gets done in 15 mins then reducer has the simple task of
> 1. grouping comparator
> 2. reducer itself (simply output records)
>
> should take less time than mappers .. instead it essentially gets stuck in
> reduce phase .. im gonna paste my code here to see if anything stands out
> as a fundamental design issue
>
> //////PARTITIONER
> public int getPartition(Text key, HCatRecord record, int numReduceTasks) {
> //extract the group key from composite key
>  String groupKey = key.toString().split("\\|")[0];
> return (groupKey.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
>  }
>
>
> ////////////GROUP COMAPRATOR
> public int compare(WritableComparable a, WritableComparable b) {
> //compare to text objects
>  String thisGroupKey = ((Text) a).toString().split("\\|")[0];
> String otherGroupKey = ((Text) b).toString().split("\\|")[0];
>  //extract
> return thisGroupKey.compareTo(otherGroupKey);
> }
>
>
> ////////////SORT COMPARATOR
> is similar to group comparator and is in map phase and gets done quick
>
>
>
> //////////REDUCER
> public void reduce(Text key, Iterable<HCatRecord> records, Context
> context) throws IOException, InterruptedException {
> log.info("in reducer for key " + key.toString());
>  Iterator<HCatRecord> recordsIter = records.iterator();
> //we are only interested in the first record after sorting and grouping
>  if(recordsIter.hasNext()){
> HCatRecord rec = recordsIter.next();
> context.write(nw, rec);
>  log.info("returned record >> " + rec.toString());
> }
> }
>
>
> On Fri, Aug 30, 2013 at 9:24 AM, Adeel Qureshi <ad...@gmail.com>wrote:
>
> yup it was negative and by doing this now it seems to be working fine
>
>
> On Fri, Aug 30, 2013 at 3:09 AM, Shekhar Sharma <sh...@gmail.com>wrote:
>
> Is the hash code of that key  is negative.?
> Do something like this
>
> return groupKey.hashCode() & Integer.MAX_VALUE % numParts;
>
> Regards,
> Som Shekhar Sharma
> +91-8197243810
>
>
> On Fri, Aug 30, 2013 at 6:25 AM, Adeel Qureshi <ad...@gmail.com>
> wrote:
> > okay so when i specify the number of reducers e.g. in my example i m
> using 4
> > (for a much smaller data set) it works if I use a single column in my
> > composite key .. but if I add multiple columns in the composite key
> > separated by a delimi .. it then throws the illegal partition error (keys
> > before the pipe are group keys and after the pipe are the sort keys and
> my
> > partioner only uses the group keys
> >
> > java.io.IOException: Illegal partition for Atlanta:GA|Atlanta:GA:1:Adeel
> > (-1)
> >         at
> >
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1073)
> >         at
> >
> org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:691)
> >         at
> >
> org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
> >         at com.att.hadoop.hivesort.HSMapper.map(HSMapper.java:39)
> >         at com.att.hadoop.hivesort.HSMapper.map(HSMapper.java:1)
> >         at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
> >         at
> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
> >         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
> >         at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
> >         at java.security.AccessController.doPrivileged(Native Method)
> >         at javax.security.auth.Subject.doAs(Subject.java:396)
> >         at
> >
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1136)
> >         at org.apache.hadoop.mapred.Child.main(Child.java:249)
> >
> >
> > public int getPartition(Text key, HCatRecord record, int numParts) {
> > //extract the group key from composite key
> > String groupKey = key.toString().split("\\|")[0];
> > return groupKey.hashCode() % numParts;
> > }
> >
> >
> > On Thu, Aug 29, 2013 at 8:31 PM, Shekhar Sharma <sh...@gmail.com>
> > wrote:
> >>
> >> No...partitionr decides which keys should go to which reducer...and
> >> number of reducers you need to decide...No of reducers depends on
> >> factors like number of key value pair, use case etc
> >> Regards,
> >> Som Shekhar Sharma
> >> +91-8197243810
> >>
> >>
> >> On Fri, Aug 30, 2013 at 5:54 AM, Adeel Qureshi <ad...@gmail.com>
> >> wrote:
> >> > so it cant figure out an appropriate number of reducers as it does for
> >> > mappers .. in my case hadoop is using 2100+ mappers and then only 1
> >> > reducer
> >> > .. since im overriding the partitioner class shouldnt that decide how
> >> > manyredeucers there should be based on how many different partition
> >> > values
> >> > being returned by the custom partiotioner
> >> >
> >> >
> >> > On Thu, Aug 29, 2013 at 7:38 PM, Ian Wrigley <ia...@cloudera.com>
> wrote:
> >> >>
> >> >> If you don't specify the number of Reducers, Hadoop will use the
> >> >> default
> >> >> -- which, unless you've changed it, is 1.
> >> >>
> >> >> Regards
> >> >>
> >> >> Ian.
> >> >>
> >> >> On Aug 29, 2013, at 4:23 PM, Adeel Qureshi <ad...@gmail.com>
> >> >> wrote:
> >> >>
> >> >> I have implemented secondary sort in my MR job and for some reason
> if i
> >> >> dont specify the number of reducers it uses 1 which doesnt seems
> right
> >> >> because im working with 800M+ records and one reducer slows things
> down
> >> >> significantly. Is this some kind of limitation with the secondary
> sort
> >> >> that
> >> >> it has to use a single reducer .. that kind of would defeat the
> purpose
> >> >> of
> >> >> having a scalable solution such as secondary sort. I would appreciate
> >> >> any
> >> >> help.
> >> >>
> >> >> Thanks
> >> >> Adeel
> >> >>
> >> >>
> >> >>
> >> >> ---
> >> >> Ian Wrigley
> >> >> Sr. Curriculum Manager
> >> >> Cloudera, Inc
> >> >> Cell: (323) 819 4075
> >> >>
> >> >
> >
> >
>
>
>
>

Re: secondary sort - number of reducers

Posted by Ravi Kiran <ra...@gmail.com>.

Adeel,
   To add to Yong's  points
a)   Consider tuning the number of threads in reduce tasks and the task
tracker process.  mapred.reduce.parallel.copies
b)   See if the map output can be compressed to ensure there is less IO .
c)   Increase the io.sort.factor to ensure the framework merges a larger
number of files in each merge sort at the reducer
d)   Check the counter "Reduce Shuffle Bytes" of  each reducer to see any
skew of data at few reducers. Try for a even distribution of load through a
better partitioner code.

Regards
Ravi Magham


On Fri, Aug 30, 2013 at 9:28 PM, java8964 java8964 <ja...@hotmail.com>wrote:

> Well, The reducers normally will take much longer than the mappers stage,
> because the copy/shuffle/sort all happened at this time, and they are the
> hard part.
>
> But before we simply say it is part of life, you need to dig into more of
> your MR jobs to find out if you can make it faster.
>
> You are the person most familiar with your data, and you wrote the code to
> group/partition them, and send them to the reducers. Even you set up 255
> reducers, the question is, do each of them get its fair share?
> You need to read the COUNTER information of each reducer, and found out
> how many reducer groups each reducer gets, and how many input bytes it get,
> etc.
>
> Simple example, if you send 200G data, and group them by DATE, if all the
> data belongs to 2 days, and one of them contains 90% of data, then in this
> case, giving 255 reducers won't help, as only 2 reducers will consume data,
> and one of them will consume 90% of data, and will finish in a very long
> time, which WILL delay the whole MR job, while the rest reducers will
> finish within seconds. In this case, maybe you need to rethink what should
> be your key, and make sure each reducer get its fair share of volume of
> data.
>
> After the above fix (in fact, normally it will fix 90% of reducer
> performance problems, especially you have 255 reducer tasks available, so
> each one average will only get 1G data, good for your huge cluster only
> needs to process 256G data :-), if you want to make it even faster, then
> check you code. Do you have to use String.compareTo()? Is it slow?  Google
> hadoop rawcomparator to see if you can do something here.
>
> After that, if you still think the reducer stage slow, check you cluster
> system. Does the reducer spend most time on copy stage, or sort, or in your
> reducer class? Find out the where the time spends, then identify the
> solution.
>
> Yong
>
> ------------------------------
> Date: Fri, 30 Aug 2013 11:02:05 -0400
>
> Subject: Re: secondary sort - number of reducers
> From: adeelmahmood@gmail.com
> To: user@hadoop.apache.org
>
>
>
> my secondary sort on multiple keys seem to work fine with smaller data
> sets but with bigger data sets (like 256 gig and 800M+ records) the mapper
> phase gets done pretty quick (about 15 mins) but then the reducer phase
> seem to take forever. I am using 255 reducers.
>
> basic idea is that my composite key has both group and sort keys in it
> which i parse in the appropriate comparator classes to perform grouping and
> sorting .. my thinking is that mappers is where most of the work is done
> 1. mapper itself (create composite key and value)
> 2. recods sorting
> 3. partiotioner
>
> if all this gets done in 15 mins then reducer has the simple task of
> 1. grouping comparator
> 2. reducer itself (simply output records)
>
> should take less time than mappers .. instead it essentially gets stuck in
> reduce phase .. im gonna paste my code here to see if anything stands out
> as a fundamental design issue
>
> //////PARTITIONER
> public int getPartition(Text key, HCatRecord record, int numReduceTasks) {
> //extract the group key from composite key
>  String groupKey = key.toString().split("\\|")[0];
> return (groupKey.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
>  }
>
>
> ////////////GROUP COMAPRATOR
> public int compare(WritableComparable a, WritableComparable b) {
> //compare to text objects
>  String thisGroupKey = ((Text) a).toString().split("\\|")[0];
> String otherGroupKey = ((Text) b).toString().split("\\|")[0];
>  //extract
> return thisGroupKey.compareTo(otherGroupKey);
> }
>
>
> ////////////SORT COMPARATOR
> is similar to group comparator and is in map phase and gets done quick
>
>
>
> //////////REDUCER
> public void reduce(Text key, Iterable<HCatRecord> records, Context
> context) throws IOException, InterruptedException {
> log.info("in reducer for key " + key.toString());
>  Iterator<HCatRecord> recordsIter = records.iterator();
> //we are only interested in the first record after sorting and grouping
>  if(recordsIter.hasNext()){
> HCatRecord rec = recordsIter.next();
> context.write(nw, rec);
>  log.info("returned record >> " + rec.toString());
> }
> }
>
>
> On Fri, Aug 30, 2013 at 9:24 AM, Adeel Qureshi <ad...@gmail.com>wrote:
>
> yup it was negative and by doing this now it seems to be working fine
>
>
> On Fri, Aug 30, 2013 at 3:09 AM, Shekhar Sharma <sh...@gmail.com>wrote:
>
> Is the hash code of that key  is negative.?
> Do something like this
>
> return groupKey.hashCode() & Integer.MAX_VALUE % numParts;
>
> Regards,
> Som Shekhar Sharma
> +91-8197243810
>
>
> On Fri, Aug 30, 2013 at 6:25 AM, Adeel Qureshi <ad...@gmail.com>
> wrote:
> > okay so when i specify the number of reducers e.g. in my example i m
> using 4
> > (for a much smaller data set) it works if I use a single column in my
> > composite key .. but if I add multiple columns in the composite key
> > separated by a delimi .. it then throws the illegal partition error (keys
> > before the pipe are group keys and after the pipe are the sort keys and
> my
> > partioner only uses the group keys
> >
> > java.io.IOException: Illegal partition for Atlanta:GA|Atlanta:GA:1:Adeel
> > (-1)
> >         at
> >
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1073)
> >         at
> >
> org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:691)
> >         at
> >
> org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
> >         at com.att.hadoop.hivesort.HSMapper.map(HSMapper.java:39)
> >         at com.att.hadoop.hivesort.HSMapper.map(HSMapper.java:1)
> >         at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
> >         at
> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
> >         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
> >         at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
> >         at java.security.AccessController.doPrivileged(Native Method)
> >         at javax.security.auth.Subject.doAs(Subject.java:396)
> >         at
> >
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1136)
> >         at org.apache.hadoop.mapred.Child.main(Child.java:249)
> >
> >
> > public int getPartition(Text key, HCatRecord record, int numParts) {
> > //extract the group key from composite key
> > String groupKey = key.toString().split("\\|")[0];
> > return groupKey.hashCode() % numParts;
> > }
> >
> >
> > On Thu, Aug 29, 2013 at 8:31 PM, Shekhar Sharma <sh...@gmail.com>
> > wrote:
> >>
> >> No...partitionr decides which keys should go to which reducer...and
> >> number of reducers you need to decide...No of reducers depends on
> >> factors like number of key value pair, use case etc
> >> Regards,
> >> Som Shekhar Sharma
> >> +91-8197243810
> >>
> >>
> >> On Fri, Aug 30, 2013 at 5:54 AM, Adeel Qureshi <ad...@gmail.com>
> >> wrote:
> >> > so it cant figure out an appropriate number of reducers as it does for
> >> > mappers .. in my case hadoop is using 2100+ mappers and then only 1
> >> > reducer
> >> > .. since im overriding the partitioner class shouldnt that decide how
> >> > manyredeucers there should be based on how many different partition
> >> > values
> >> > being returned by the custom partiotioner
> >> >
> >> >
> >> > On Thu, Aug 29, 2013 at 7:38 PM, Ian Wrigley <ia...@cloudera.com>
> wrote:
> >> >>
> >> >> If you don't specify the number of Reducers, Hadoop will use the
> >> >> default
> >> >> -- which, unless you've changed it, is 1.
> >> >>
> >> >> Regards
> >> >>
> >> >> Ian.
> >> >>
> >> >> On Aug 29, 2013, at 4:23 PM, Adeel Qureshi <ad...@gmail.com>
> >> >> wrote:
> >> >>
> >> >> I have implemented secondary sort in my MR job and for some reason
> if i
> >> >> dont specify the number of reducers it uses 1 which doesnt seems
> right
> >> >> because im working with 800M+ records and one reducer slows things
> down
> >> >> significantly. Is this some kind of limitation with the secondary
> sort
> >> >> that
> >> >> it has to use a single reducer .. that kind of would defeat the
> purpose
> >> >> of
> >> >> having a scalable solution such as secondary sort. I would appreciate
> >> >> any
> >> >> help.
> >> >>
> >> >> Thanks
> >> >> Adeel
> >> >>
> >> >>
> >> >>
> >> >> ---
> >> >> Ian Wrigley
> >> >> Sr. Curriculum Manager
> >> >> Cloudera, Inc
> >> >> Cell: (323) 819 4075
> >> >>
> >> >
> >
> >
>
>
>
>

RE: secondary sort - number of reducers

Posted by java8964 java8964 <ja...@hotmail.com>.

Well, The reducers normally will take much longer than the mappers stage, because the copy/shuffle/sort all happened at this time, and they are the hard part.
But before we simply say it is part of life, you need to dig into more of your MR jobs to find out if you can make it faster.
You are the person most familiar with your data, and you wrote the code to group/partition them, and send them to the reducers. Even you set up 255 reducers, the question is, do each of them get its fair share?You need to read the COUNTER information of each reducer, and found out how many reducer groups each reducer gets, and how many input bytes it get, etc.
Simple example, if you send 200G data, and group them by DATE, if all the data belongs to 2 days, and one of them contains 90% of data, then in this case, giving 255 reducers won't help, as only 2 reducers will consume data, and one of them will consume 90% of data, and will finish in a very long time, which WILL delay the whole MR job, while the rest reducers will finish within seconds. In this case, maybe you need to rethink what should be your key, and make sure each reducer get its fair share of volume of data.
After the above fix (in fact, normally it will fix 90% of reducer performance problems, especially you have 255 reducer tasks available, so each one average will only get 1G data, good for your huge cluster only needs to process 256G data :-), if you want to make it even faster, then check you code. Do you have to use String.compareTo()? Is it slow?  Google hadoop rawcomparator to see if you can do something here.
After that, if you still think the reducer stage slow, check you cluster system. Does the reducer spend most time on copy stage, or sort, or in your reducer class? Find out the where the time spends, then identify the solution.
Yong

Date: Fri, 30 Aug 2013 11:02:05 -0400
Subject: Re: secondary sort - number of reducers
From: adeelmahmood@gmail.com
To: user@hadoop.apache.org

my secondary sort on multiple keys seem to work fine with smaller data sets but with bigger data sets (like 256 gig and 800M+ records) the mapper phase gets done pretty quick (about 15 mins) but then the reducer phase seem to take forever. I am using 255 reducers.

basic idea is that my composite key has both group and sort keys in it which i parse in the appropriate comparator classes to perform grouping and sorting .. my thinking is that mappers is where most of the work is done 
1. mapper itself (create composite key and value)2. recods sorting3. partiotioner
if all this gets done in 15 mins then reducer has the simple task of1. grouping comparator
2. reducer itself (simply output records)
should take less time than mappers .. instead it essentially gets stuck in reduce phase .. im gonna paste my code here to see if anything stands out as a fundamental design issue

//////PARTITIONERpublic int getPartition(Text key, HCatRecord record, int numReduceTasks) {		//extract the group key from composite key
		String groupKey = key.toString().split("\\|")[0];				return (groupKey.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
	}

////////////GROUP COMAPRATORpublic int compare(WritableComparable a, WritableComparable b) {		//compare to text objects
		String thisGroupKey = ((Text) a).toString().split("\\|")[0];		String otherGroupKey = ((Text) b).toString().split("\\|")[0];
		//extract 		return thisGroupKey.compareTo(otherGroupKey);	}

////////////SORT COMPARATOR is similar to group comparator and is in map phase and gets done quick 

//////////REDUCER
public void reduce(Text key, Iterable<HCatRecord> records, Context context) throws IOException, InterruptedException {		log.info("in reducer for key " + key.toString());
		Iterator<HCatRecord> recordsIter = records.iterator();		//we are only interested in the first record after sorting and grouping
		if(recordsIter.hasNext()){			HCatRecord rec = recordsIter.next();			context.write(nw, rec);
			log.info("returned record >> " + rec.toString());		}	}

On Fri, Aug 30, 2013 at 9:24 AM, Adeel Qureshi <ad...@gmail.com> wrote:

yup it was negative and by doing this now it seems to be working fine

On Fri, Aug 30, 2013 at 3:09 AM, Shekhar Sharma <sh...@gmail.com> wrote:

Is the hash code of that key  is negative.?

Do something like this

return groupKey.hashCode() & Integer.MAX_VALUE % numParts;

Regards,

Som Shekhar Sharma

+91-8197243810

On Fri, Aug 30, 2013 at 6:25 AM, Adeel Qureshi <ad...@gmail.com> wrote:

> okay so when i specify the number of reducers e.g. in my example i m using 4

> (for a much smaller data set) it works if I use a single column in my

> composite key .. but if I add multiple columns in the composite key

> separated by a delimi .. it then throws the illegal partition error (keys

> before the pipe are group keys and after the pipe are the sort keys and my

> partioner only uses the group keys

>

> java.io.IOException: Illegal partition for Atlanta:GA|Atlanta:GA:1:Adeel

> (-1)

>         at

> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1073)

>         at

> org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:691)

>         at

> org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)

>         at com.att.hadoop.hivesort.HSMapper.map(HSMapper.java:39)

>         at com.att.hadoop.hivesort.HSMapper.map(HSMapper.java:1)

>         at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)

>         at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)

>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)

>         at org.apache.hadoop.mapred.Child$4.run(Child.java:255)

>         at java.security.AccessController.doPrivileged(Native Method)

>         at javax.security.auth.Subject.doAs(Subject.java:396)

>         at

> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1136)

>         at org.apache.hadoop.mapred.Child.main(Child.java:249)

>

>

> public int getPartition(Text key, HCatRecord record, int numParts) {

> //extract the group key from composite key

> String groupKey = key.toString().split("\\|")[0];

> return groupKey.hashCode() % numParts;

> }

>

>

> On Thu, Aug 29, 2013 at 8:31 PM, Shekhar Sharma <sh...@gmail.com>

> wrote:

>>

>> No...partitionr decides which keys should go to which reducer...and

>> number of reducers you need to decide...No of reducers depends on

>> factors like number of key value pair, use case etc

>> Regards,

>> Som Shekhar Sharma

>> +91-8197243810

>>

>>

>> On Fri, Aug 30, 2013 at 5:54 AM, Adeel Qureshi <ad...@gmail.com>

>> wrote:

>> > so it cant figure out an appropriate number of reducers as it does for

>> > mappers .. in my case hadoop is using 2100+ mappers and then only 1

>> > reducer

>> > .. since im overriding the partitioner class shouldnt that decide how

>> > manyredeucers there should be based on how many different partition

>> > values

>> > being returned by the custom partiotioner

>> >

>> >

>> > On Thu, Aug 29, 2013 at 7:38 PM, Ian Wrigley <ia...@cloudera.com> wrote:

>> >>

>> >> If you don't specify the number of Reducers, Hadoop will use the

>> >> default

>> >> -- which, unless you've changed it, is 1.

>> >>

>> >> Regards

>> >>

>> >> Ian.

>> >>

>> >> On Aug 29, 2013, at 4:23 PM, Adeel Qureshi <ad...@gmail.com>

>> >> wrote:

>> >>

>> >> I have implemented secondary sort in my MR job and for some reason if i

>> >> dont specify the number of reducers it uses 1 which doesnt seems right

>> >> because im working with 800M+ records and one reducer slows things down

>> >> significantly. Is this some kind of limitation with the secondary sort

>> >> that

>> >> it has to use a single reducer .. that kind of would defeat the purpose

>> >> of

>> >> having a scalable solution such as secondary sort. I would appreciate

>> >> any

>> >> help.

>> >>

>> >> Thanks

>> >> Adeel

>> >>

>> >>

>> >>

>> >> ---

>> >> Ian Wrigley

>> >> Sr. Curriculum Manager

>> >> Cloudera, Inc

>> >> Cell: (323) 819 4075

>> >>

>> >

>

>

RE: secondary sort - number of reducers

Posted by java8964 java8964 <ja...@hotmail.com>.

Well, The reducers normally will take much longer than the mappers stage, because the copy/shuffle/sort all happened at this time, and they are the hard part.
But before we simply say it is part of life, you need to dig into more of your MR jobs to find out if you can make it faster.
You are the person most familiar with your data, and you wrote the code to group/partition them, and send them to the reducers. Even you set up 255 reducers, the question is, do each of them get its fair share?You need to read the COUNTER information of each reducer, and found out how many reducer groups each reducer gets, and how many input bytes it get, etc.
Simple example, if you send 200G data, and group them by DATE, if all the data belongs to 2 days, and one of them contains 90% of data, then in this case, giving 255 reducers won't help, as only 2 reducers will consume data, and one of them will consume 90% of data, and will finish in a very long time, which WILL delay the whole MR job, while the rest reducers will finish within seconds. In this case, maybe you need to rethink what should be your key, and make sure each reducer get its fair share of volume of data.
After the above fix (in fact, normally it will fix 90% of reducer performance problems, especially you have 255 reducer tasks available, so each one average will only get 1G data, good for your huge cluster only needs to process 256G data :-), if you want to make it even faster, then check you code. Do you have to use String.compareTo()? Is it slow?  Google hadoop rawcomparator to see if you can do something here.
After that, if you still think the reducer stage slow, check you cluster system. Does the reducer spend most time on copy stage, or sort, or in your reducer class? Find out the where the time spends, then identify the solution.
Yong

Date: Fri, 30 Aug 2013 11:02:05 -0400
Subject: Re: secondary sort - number of reducers
From: adeelmahmood@gmail.com
To: user@hadoop.apache.org

my secondary sort on multiple keys seem to work fine with smaller data sets but with bigger data sets (like 256 gig and 800M+ records) the mapper phase gets done pretty quick (about 15 mins) but then the reducer phase seem to take forever. I am using 255 reducers.

basic idea is that my composite key has both group and sort keys in it which i parse in the appropriate comparator classes to perform grouping and sorting .. my thinking is that mappers is where most of the work is done 
1. mapper itself (create composite key and value)2. recods sorting3. partiotioner
if all this gets done in 15 mins then reducer has the simple task of1. grouping comparator
2. reducer itself (simply output records)
should take less time than mappers .. instead it essentially gets stuck in reduce phase .. im gonna paste my code here to see if anything stands out as a fundamental design issue

//////PARTITIONERpublic int getPartition(Text key, HCatRecord record, int numReduceTasks) {		//extract the group key from composite key
		String groupKey = key.toString().split("\\|")[0];				return (groupKey.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
	}

////////////GROUP COMAPRATORpublic int compare(WritableComparable a, WritableComparable b) {		//compare to text objects
		String thisGroupKey = ((Text) a).toString().split("\\|")[0];		String otherGroupKey = ((Text) b).toString().split("\\|")[0];
		//extract 		return thisGroupKey.compareTo(otherGroupKey);	}

////////////SORT COMPARATOR is similar to group comparator and is in map phase and gets done quick 

//////////REDUCER
public void reduce(Text key, Iterable<HCatRecord> records, Context context) throws IOException, InterruptedException {		log.info("in reducer for key " + key.toString());
		Iterator<HCatRecord> recordsIter = records.iterator();		//we are only interested in the first record after sorting and grouping
		if(recordsIter.hasNext()){			HCatRecord rec = recordsIter.next();			context.write(nw, rec);
			log.info("returned record >> " + rec.toString());		}	}

On Fri, Aug 30, 2013 at 9:24 AM, Adeel Qureshi <ad...@gmail.com> wrote:

yup it was negative and by doing this now it seems to be working fine

On Fri, Aug 30, 2013 at 3:09 AM, Shekhar Sharma <sh...@gmail.com> wrote:

Is the hash code of that key  is negative.?

Do something like this

return groupKey.hashCode() & Integer.MAX_VALUE % numParts;

Regards,

Som Shekhar Sharma

+91-8197243810

On Fri, Aug 30, 2013 at 6:25 AM, Adeel Qureshi <ad...@gmail.com> wrote:

> okay so when i specify the number of reducers e.g. in my example i m using 4

> (for a much smaller data set) it works if I use a single column in my

> composite key .. but if I add multiple columns in the composite key

> separated by a delimi .. it then throws the illegal partition error (keys

> before the pipe are group keys and after the pipe are the sort keys and my

> partioner only uses the group keys

>

> java.io.IOException: Illegal partition for Atlanta:GA|Atlanta:GA:1:Adeel

> (-1)

>         at

> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1073)

>         at

> org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:691)

>         at

> org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)

>         at com.att.hadoop.hivesort.HSMapper.map(HSMapper.java:39)

>         at com.att.hadoop.hivesort.HSMapper.map(HSMapper.java:1)

>         at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)

>         at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)

>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)

>         at org.apache.hadoop.mapred.Child$4.run(Child.java:255)

>         at java.security.AccessController.doPrivileged(Native Method)

>         at javax.security.auth.Subject.doAs(Subject.java:396)

>         at

> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1136)

>         at org.apache.hadoop.mapred.Child.main(Child.java:249)

>

>

> public int getPartition(Text key, HCatRecord record, int numParts) {

> //extract the group key from composite key

> String groupKey = key.toString().split("\\|")[0];

> return groupKey.hashCode() % numParts;

> }

>

>

> On Thu, Aug 29, 2013 at 8:31 PM, Shekhar Sharma <sh...@gmail.com>

> wrote:

>>

>> No...partitionr decides which keys should go to which reducer...and

>> number of reducers you need to decide...No of reducers depends on

>> factors like number of key value pair, use case etc

>> Regards,

>> Som Shekhar Sharma

>> +91-8197243810

>>

>>

>> On Fri, Aug 30, 2013 at 5:54 AM, Adeel Qureshi <ad...@gmail.com>

>> wrote:

>> > so it cant figure out an appropriate number of reducers as it does for

>> > mappers .. in my case hadoop is using 2100+ mappers and then only 1

>> > reducer

>> > .. since im overriding the partitioner class shouldnt that decide how

>> > manyredeucers there should be based on how many different partition

>> > values

>> > being returned by the custom partiotioner

>> >

>> >

>> > On Thu, Aug 29, 2013 at 7:38 PM, Ian Wrigley <ia...@cloudera.com> wrote:

>> >>

>> >> If you don't specify the number of Reducers, Hadoop will use the

>> >> default

>> >> -- which, unless you've changed it, is 1.

>> >>

>> >> Regards

>> >>

>> >> Ian.

>> >>

>> >> On Aug 29, 2013, at 4:23 PM, Adeel Qureshi <ad...@gmail.com>

>> >> wrote:

>> >>

>> >> I have implemented secondary sort in my MR job and for some reason if i

>> >> dont specify the number of reducers it uses 1 which doesnt seems right

>> >> because im working with 800M+ records and one reducer slows things down

>> >> significantly. Is this some kind of limitation with the secondary sort

>> >> that

>> >> it has to use a single reducer .. that kind of would defeat the purpose

>> >> of

>> >> having a scalable solution such as secondary sort. I would appreciate

>> >> any

>> >> help.

>> >>

>> >> Thanks

>> >> Adeel

>> >>

>> >>

>> >>

>> >> ---

>> >> Ian Wrigley

>> >> Sr. Curriculum Manager

>> >> Cloudera, Inc

>> >> Cell: (323) 819 4075

>> >>

>> >

>

>

RE: secondary sort - number of reducers

Posted by java8964 java8964 <ja...@hotmail.com>.

Well, The reducers normally will take much longer than the mappers stage, because the copy/shuffle/sort all happened at this time, and they are the hard part.
But before we simply say it is part of life, you need to dig into more of your MR jobs to find out if you can make it faster.
You are the person most familiar with your data, and you wrote the code to group/partition them, and send them to the reducers. Even you set up 255 reducers, the question is, do each of them get its fair share?You need to read the COUNTER information of each reducer, and found out how many reducer groups each reducer gets, and how many input bytes it get, etc.
Simple example, if you send 200G data, and group them by DATE, if all the data belongs to 2 days, and one of them contains 90% of data, then in this case, giving 255 reducers won't help, as only 2 reducers will consume data, and one of them will consume 90% of data, and will finish in a very long time, which WILL delay the whole MR job, while the rest reducers will finish within seconds. In this case, maybe you need to rethink what should be your key, and make sure each reducer get its fair share of volume of data.
After the above fix (in fact, normally it will fix 90% of reducer performance problems, especially you have 255 reducer tasks available, so each one average will only get 1G data, good for your huge cluster only needs to process 256G data :-), if you want to make it even faster, then check you code. Do you have to use String.compareTo()? Is it slow?  Google hadoop rawcomparator to see if you can do something here.
After that, if you still think the reducer stage slow, check you cluster system. Does the reducer spend most time on copy stage, or sort, or in your reducer class? Find out the where the time spends, then identify the solution.
Yong

Date: Fri, 30 Aug 2013 11:02:05 -0400
Subject: Re: secondary sort - number of reducers
From: adeelmahmood@gmail.com
To: user@hadoop.apache.org

my secondary sort on multiple keys seem to work fine with smaller data sets but with bigger data sets (like 256 gig and 800M+ records) the mapper phase gets done pretty quick (about 15 mins) but then the reducer phase seem to take forever. I am using 255 reducers.

basic idea is that my composite key has both group and sort keys in it which i parse in the appropriate comparator classes to perform grouping and sorting .. my thinking is that mappers is where most of the work is done 
1. mapper itself (create composite key and value)2. recods sorting3. partiotioner
if all this gets done in 15 mins then reducer has the simple task of1. grouping comparator
2. reducer itself (simply output records)
should take less time than mappers .. instead it essentially gets stuck in reduce phase .. im gonna paste my code here to see if anything stands out as a fundamental design issue

//////PARTITIONERpublic int getPartition(Text key, HCatRecord record, int numReduceTasks) {		//extract the group key from composite key
		String groupKey = key.toString().split("\\|")[0];				return (groupKey.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
	}

////////////GROUP COMAPRATORpublic int compare(WritableComparable a, WritableComparable b) {		//compare to text objects
		String thisGroupKey = ((Text) a).toString().split("\\|")[0];		String otherGroupKey = ((Text) b).toString().split("\\|")[0];
		//extract 		return thisGroupKey.compareTo(otherGroupKey);	}

////////////SORT COMPARATOR is similar to group comparator and is in map phase and gets done quick 

//////////REDUCER
public void reduce(Text key, Iterable<HCatRecord> records, Context context) throws IOException, InterruptedException {		log.info("in reducer for key " + key.toString());
		Iterator<HCatRecord> recordsIter = records.iterator();		//we are only interested in the first record after sorting and grouping
		if(recordsIter.hasNext()){			HCatRecord rec = recordsIter.next();			context.write(nw, rec);
			log.info("returned record >> " + rec.toString());		}	}

On Fri, Aug 30, 2013 at 9:24 AM, Adeel Qureshi <ad...@gmail.com> wrote:

yup it was negative and by doing this now it seems to be working fine

On Fri, Aug 30, 2013 at 3:09 AM, Shekhar Sharma <sh...@gmail.com> wrote:

Is the hash code of that key  is negative.?

Do something like this

return groupKey.hashCode() & Integer.MAX_VALUE % numParts;

Regards,

Som Shekhar Sharma

+91-8197243810

On Fri, Aug 30, 2013 at 6:25 AM, Adeel Qureshi <ad...@gmail.com> wrote:

> okay so when i specify the number of reducers e.g. in my example i m using 4

> (for a much smaller data set) it works if I use a single column in my

> composite key .. but if I add multiple columns in the composite key

> separated by a delimi .. it then throws the illegal partition error (keys

> before the pipe are group keys and after the pipe are the sort keys and my

> partioner only uses the group keys

>

> java.io.IOException: Illegal partition for Atlanta:GA|Atlanta:GA:1:Adeel

> (-1)

>         at

> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1073)

>         at

> org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:691)

>         at

> org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)

>         at com.att.hadoop.hivesort.HSMapper.map(HSMapper.java:39)

>         at com.att.hadoop.hivesort.HSMapper.map(HSMapper.java:1)

>         at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)

>         at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)

>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)

>         at org.apache.hadoop.mapred.Child$4.run(Child.java:255)

>         at java.security.AccessController.doPrivileged(Native Method)

>         at javax.security.auth.Subject.doAs(Subject.java:396)

>         at

> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1136)

>         at org.apache.hadoop.mapred.Child.main(Child.java:249)

>

>

> public int getPartition(Text key, HCatRecord record, int numParts) {

> //extract the group key from composite key

> String groupKey = key.toString().split("\\|")[0];

> return groupKey.hashCode() % numParts;

> }

>

>

> On Thu, Aug 29, 2013 at 8:31 PM, Shekhar Sharma <sh...@gmail.com>

> wrote:

>>

>> No...partitionr decides which keys should go to which reducer...and

>> number of reducers you need to decide...No of reducers depends on

>> factors like number of key value pair, use case etc

>> Regards,

>> Som Shekhar Sharma

>> +91-8197243810

>>

>>

>> On Fri, Aug 30, 2013 at 5:54 AM, Adeel Qureshi <ad...@gmail.com>

>> wrote:

>> > so it cant figure out an appropriate number of reducers as it does for

>> > mappers .. in my case hadoop is using 2100+ mappers and then only 1

>> > reducer

>> > .. since im overriding the partitioner class shouldnt that decide how

>> > manyredeucers there should be based on how many different partition

>> > values

>> > being returned by the custom partiotioner

>> >

>> >

>> > On Thu, Aug 29, 2013 at 7:38 PM, Ian Wrigley <ia...@cloudera.com> wrote:

>> >>

>> >> If you don't specify the number of Reducers, Hadoop will use the

>> >> default

>> >> -- which, unless you've changed it, is 1.

>> >>

>> >> Regards

>> >>

>> >> Ian.

>> >>

>> >> On Aug 29, 2013, at 4:23 PM, Adeel Qureshi <ad...@gmail.com>

>> >> wrote:

>> >>

>> >> I have implemented secondary sort in my MR job and for some reason if i

>> >> dont specify the number of reducers it uses 1 which doesnt seems right

>> >> because im working with 800M+ records and one reducer slows things down

>> >> significantly. Is this some kind of limitation with the secondary sort

>> >> that

>> >> it has to use a single reducer .. that kind of would defeat the purpose

>> >> of

>> >> having a scalable solution such as secondary sort. I would appreciate

>> >> any

>> >> help.

>> >>

>> >> Thanks

>> >> Adeel

>> >>

>> >>

>> >>

>> >> ---

>> >> Ian Wrigley

>> >> Sr. Curriculum Manager

>> >> Cloudera, Inc

>> >> Cell: (323) 819 4075

>> >>

>> >

>

>

RE: secondary sort - number of reducers

Posted by java8964 java8964 <ja...@hotmail.com>.

Well, The reducers normally will take much longer than the mappers stage, because the copy/shuffle/sort all happened at this time, and they are the hard part.
But before we simply say it is part of life, you need to dig into more of your MR jobs to find out if you can make it faster.
You are the person most familiar with your data, and you wrote the code to group/partition them, and send them to the reducers. Even you set up 255 reducers, the question is, do each of them get its fair share?You need to read the COUNTER information of each reducer, and found out how many reducer groups each reducer gets, and how many input bytes it get, etc.
Simple example, if you send 200G data, and group them by DATE, if all the data belongs to 2 days, and one of them contains 90% of data, then in this case, giving 255 reducers won't help, as only 2 reducers will consume data, and one of them will consume 90% of data, and will finish in a very long time, which WILL delay the whole MR job, while the rest reducers will finish within seconds. In this case, maybe you need to rethink what should be your key, and make sure each reducer get its fair share of volume of data.
After the above fix (in fact, normally it will fix 90% of reducer performance problems, especially you have 255 reducer tasks available, so each one average will only get 1G data, good for your huge cluster only needs to process 256G data :-), if you want to make it even faster, then check you code. Do you have to use String.compareTo()? Is it slow?  Google hadoop rawcomparator to see if you can do something here.
After that, if you still think the reducer stage slow, check you cluster system. Does the reducer spend most time on copy stage, or sort, or in your reducer class? Find out the where the time spends, then identify the solution.
Yong

Date: Fri, 30 Aug 2013 11:02:05 -0400
Subject: Re: secondary sort - number of reducers
From: adeelmahmood@gmail.com
To: user@hadoop.apache.org

my secondary sort on multiple keys seem to work fine with smaller data sets but with bigger data sets (like 256 gig and 800M+ records) the mapper phase gets done pretty quick (about 15 mins) but then the reducer phase seem to take forever. I am using 255 reducers.

basic idea is that my composite key has both group and sort keys in it which i parse in the appropriate comparator classes to perform grouping and sorting .. my thinking is that mappers is where most of the work is done 
1. mapper itself (create composite key and value)2. recods sorting3. partiotioner
if all this gets done in 15 mins then reducer has the simple task of1. grouping comparator
2. reducer itself (simply output records)
should take less time than mappers .. instead it essentially gets stuck in reduce phase .. im gonna paste my code here to see if anything stands out as a fundamental design issue

//////PARTITIONERpublic int getPartition(Text key, HCatRecord record, int numReduceTasks) {		//extract the group key from composite key
		String groupKey = key.toString().split("\\|")[0];				return (groupKey.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
	}

////////////GROUP COMAPRATORpublic int compare(WritableComparable a, WritableComparable b) {		//compare to text objects
		String thisGroupKey = ((Text) a).toString().split("\\|")[0];		String otherGroupKey = ((Text) b).toString().split("\\|")[0];
		//extract 		return thisGroupKey.compareTo(otherGroupKey);	}

////////////SORT COMPARATOR is similar to group comparator and is in map phase and gets done quick 

//////////REDUCER
public void reduce(Text key, Iterable<HCatRecord> records, Context context) throws IOException, InterruptedException {		log.info("in reducer for key " + key.toString());
		Iterator<HCatRecord> recordsIter = records.iterator();		//we are only interested in the first record after sorting and grouping
		if(recordsIter.hasNext()){			HCatRecord rec = recordsIter.next();			context.write(nw, rec);
			log.info("returned record >> " + rec.toString());		}	}

On Fri, Aug 30, 2013 at 9:24 AM, Adeel Qureshi <ad...@gmail.com> wrote:

yup it was negative and by doing this now it seems to be working fine

On Fri, Aug 30, 2013 at 3:09 AM, Shekhar Sharma <sh...@gmail.com> wrote:

Is the hash code of that key  is negative.?

Do something like this

return groupKey.hashCode() & Integer.MAX_VALUE % numParts;

Regards,

Som Shekhar Sharma

+91-8197243810

On Fri, Aug 30, 2013 at 6:25 AM, Adeel Qureshi <ad...@gmail.com> wrote:

> okay so when i specify the number of reducers e.g. in my example i m using 4

> (for a much smaller data set) it works if I use a single column in my

> composite key .. but if I add multiple columns in the composite key

> separated by a delimi .. it then throws the illegal partition error (keys

> before the pipe are group keys and after the pipe are the sort keys and my

> partioner only uses the group keys

>

> java.io.IOException: Illegal partition for Atlanta:GA|Atlanta:GA:1:Adeel

> (-1)

>         at

> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1073)

>         at

> org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:691)

>         at

> org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)

>         at com.att.hadoop.hivesort.HSMapper.map(HSMapper.java:39)

>         at com.att.hadoop.hivesort.HSMapper.map(HSMapper.java:1)

>         at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)

>         at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)

>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)

>         at org.apache.hadoop.mapred.Child$4.run(Child.java:255)

>         at java.security.AccessController.doPrivileged(Native Method)

>         at javax.security.auth.Subject.doAs(Subject.java:396)

>         at

> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1136)

>         at org.apache.hadoop.mapred.Child.main(Child.java:249)

>

>

> public int getPartition(Text key, HCatRecord record, int numParts) {

> //extract the group key from composite key

> String groupKey = key.toString().split("\\|")[0];

> return groupKey.hashCode() % numParts;

> }

>

>

> On Thu, Aug 29, 2013 at 8:31 PM, Shekhar Sharma <sh...@gmail.com>

> wrote:

>>

>> No...partitionr decides which keys should go to which reducer...and

>> number of reducers you need to decide...No of reducers depends on

>> factors like number of key value pair, use case etc

>> Regards,

>> Som Shekhar Sharma

>> +91-8197243810

>>

>>

>> On Fri, Aug 30, 2013 at 5:54 AM, Adeel Qureshi <ad...@gmail.com>

>> wrote:

>> > so it cant figure out an appropriate number of reducers as it does for

>> > mappers .. in my case hadoop is using 2100+ mappers and then only 1

>> > reducer

>> > .. since im overriding the partitioner class shouldnt that decide how

>> > manyredeucers there should be based on how many different partition

>> > values

>> > being returned by the custom partiotioner

>> >

>> >

>> > On Thu, Aug 29, 2013 at 7:38 PM, Ian Wrigley <ia...@cloudera.com> wrote:

>> >>

>> >> If you don't specify the number of Reducers, Hadoop will use the

>> >> default

>> >> -- which, unless you've changed it, is 1.

>> >>

>> >> Regards

>> >>

>> >> Ian.

>> >>

>> >> On Aug 29, 2013, at 4:23 PM, Adeel Qureshi <ad...@gmail.com>

>> >> wrote:

>> >>

>> >> I have implemented secondary sort in my MR job and for some reason if i

>> >> dont specify the number of reducers it uses 1 which doesnt seems right

>> >> because im working with 800M+ records and one reducer slows things down

>> >> significantly. Is this some kind of limitation with the secondary sort

>> >> that

>> >> it has to use a single reducer .. that kind of would defeat the purpose

>> >> of

>> >> having a scalable solution such as secondary sort. I would appreciate

>> >> any

>> >> help.

>> >>

>> >> Thanks

>> >> Adeel

>> >>

>> >>

>> >>

>> >> ---

>> >> Ian Wrigley

>> >> Sr. Curriculum Manager

>> >> Cloudera, Inc

>> >> Cell: (323) 819 4075

>> >>

>> >

>

>

Re: secondary sort - number of reducers

Posted by Adeel Qureshi <ad...@gmail.com>.

my secondary sort on multiple keys seem to work fine with smaller data sets
but with bigger data sets (like 256 gig and 800M+ records) the mapper phase
gets done pretty quick (about 15 mins) but then the reducer phase seem to
take forever. I am using 255 reducers.

basic idea is that my composite key has both group and sort keys in it
which i parse in the appropriate comparator classes to perform grouping and
sorting .. my thinking is that mappers is where most of the work is done
1. mapper itself (create composite key and value)
2. recods sorting
3. partiotioner

if all this gets done in 15 mins then reducer has the simple task of
1. grouping comparator
2. reducer itself (simply output records)

should take less time than mappers .. instead it essentially gets stuck in
reduce phase .. im gonna paste my code here to see if anything stands out
as a fundamental design issue

//////PARTITIONER
public int getPartition(Text key, HCatRecord record, int numReduceTasks) {
//extract the group key from composite key
String groupKey = key.toString().split("\\|")[0];
return (groupKey.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
}


////////////GROUP COMAPRATOR
public int compare(WritableComparable a, WritableComparable b) {
//compare to text objects
String thisGroupKey = ((Text) a).toString().split("\\|")[0];
String otherGroupKey = ((Text) b).toString().split("\\|")[0];
//extract
return thisGroupKey.compareTo(otherGroupKey);
}


////////////SORT COMPARATOR
is similar to group comparator and is in map phase and gets done quick



//////////REDUCER
public void reduce(Text key, Iterable<HCatRecord> records, Context context)
throws IOException, InterruptedException {
log.info("in reducer for key " + key.toString());
Iterator<HCatRecord> recordsIter = records.iterator();
//we are only interested in the first record after sorting and grouping
if(recordsIter.hasNext()){
HCatRecord rec = recordsIter.next();
context.write(nw, rec);
log.info("returned record >> " + rec.toString());
}
}


On Fri, Aug 30, 2013 at 9:24 AM, Adeel Qureshi <ad...@gmail.com>wrote:

> yup it was negative and by doing this now it seems to be working fine
>
>
> On Fri, Aug 30, 2013 at 3:09 AM, Shekhar Sharma <sh...@gmail.com>wrote:
>
>> Is the hash code of that key  is negative.?
>> Do something like this
>>
>> return groupKey.hashCode() & Integer.MAX_VALUE % numParts;
>>
>> Regards,
>> Som Shekhar Sharma
>> +91-8197243810
>>
>>
>> On Fri, Aug 30, 2013 at 6:25 AM, Adeel Qureshi <ad...@gmail.com>
>> wrote:
>> > okay so when i specify the number of reducers e.g. in my example i m
>> using 4
>> > (for a much smaller data set) it works if I use a single column in my
>> > composite key .. but if I add multiple columns in the composite key
>> > separated by a delimi .. it then throws the illegal partition error
>> (keys
>> > before the pipe are group keys and after the pipe are the sort keys and
>> my
>> > partioner only uses the group keys
>> >
>> > java.io.IOException: Illegal partition for Atlanta:GA|Atlanta:GA:1:Adeel
>> > (-1)
>> >         at
>> >
>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1073)
>> >         at
>> >
>> org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:691)
>> >         at
>> >
>> org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
>> >         at com.att.hadoop.hivesort.HSMapper.map(HSMapper.java:39)
>> >         at com.att.hadoop.hivesort.HSMapper.map(HSMapper.java:1)
>> >         at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>> >         at
>> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
>> >         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
>> >         at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>> >         at java.security.AccessController.doPrivileged(Native Method)
>> >         at javax.security.auth.Subject.doAs(Subject.java:396)
>> >         at
>> >
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1136)
>> >         at org.apache.hadoop.mapred.Child.main(Child.java:249)
>> >
>> >
>> > public int getPartition(Text key, HCatRecord record, int numParts) {
>> > //extract the group key from composite key
>> > String groupKey = key.toString().split("\\|")[0];
>> > return groupKey.hashCode() % numParts;
>> > }
>> >
>> >
>> > On Thu, Aug 29, 2013 at 8:31 PM, Shekhar Sharma <sh...@gmail.com>
>> > wrote:
>> >>
>> >> No...partitionr decides which keys should go to which reducer...and
>> >> number of reducers you need to decide...No of reducers depends on
>> >> factors like number of key value pair, use case etc
>> >> Regards,
>> >> Som Shekhar Sharma
>> >> +91-8197243810
>> >>
>> >>
>> >> On Fri, Aug 30, 2013 at 5:54 AM, Adeel Qureshi <adeelmahmood@gmail.com
>> >
>> >> wrote:
>> >> > so it cant figure out an appropriate number of reducers as it does
>> for
>> >> > mappers .. in my case hadoop is using 2100+ mappers and then only 1
>> >> > reducer
>> >> > .. since im overriding the partitioner class shouldnt that decide how
>> >> > manyredeucers there should be based on how many different partition
>> >> > values
>> >> > being returned by the custom partiotioner
>> >> >
>> >> >
>> >> > On Thu, Aug 29, 2013 at 7:38 PM, Ian Wrigley <ia...@cloudera.com>
>> wrote:
>> >> >>
>> >> >> If you don't specify the number of Reducers, Hadoop will use the
>> >> >> default
>> >> >> -- which, unless you've changed it, is 1.
>> >> >>
>> >> >> Regards
>> >> >>
>> >> >> Ian.
>> >> >>
>> >> >> On Aug 29, 2013, at 4:23 PM, Adeel Qureshi <ad...@gmail.com>
>> >> >> wrote:
>> >> >>
>> >> >> I have implemented secondary sort in my MR job and for some reason
>> if i
>> >> >> dont specify the number of reducers it uses 1 which doesnt seems
>> right
>> >> >> because im working with 800M+ records and one reducer slows things
>> down
>> >> >> significantly. Is this some kind of limitation with the secondary
>> sort
>> >> >> that
>> >> >> it has to use a single reducer .. that kind of would defeat the
>> purpose
>> >> >> of
>> >> >> having a scalable solution such as secondary sort. I would
>> appreciate
>> >> >> any
>> >> >> help.
>> >> >>
>> >> >> Thanks
>> >> >> Adeel
>> >> >>
>> >> >>
>> >> >>
>> >> >> ---
>> >> >> Ian Wrigley
>> >> >> Sr. Curriculum Manager
>> >> >> Cloudera, Inc
>> >> >> Cell: (323) 819 4075
>> >> >>
>> >> >
>> >
>> >
>>
>
>

Re: secondary sort - number of reducers

Posted by Adeel Qureshi <ad...@gmail.com>.

my secondary sort on multiple keys seem to work fine with smaller data sets
but with bigger data sets (like 256 gig and 800M+ records) the mapper phase
gets done pretty quick (about 15 mins) but then the reducer phase seem to
take forever. I am using 255 reducers.

basic idea is that my composite key has both group and sort keys in it
which i parse in the appropriate comparator classes to perform grouping and
sorting .. my thinking is that mappers is where most of the work is done
1. mapper itself (create composite key and value)
2. recods sorting
3. partiotioner

if all this gets done in 15 mins then reducer has the simple task of
1. grouping comparator
2. reducer itself (simply output records)

should take less time than mappers .. instead it essentially gets stuck in
reduce phase .. im gonna paste my code here to see if anything stands out
as a fundamental design issue

//////PARTITIONER
public int getPartition(Text key, HCatRecord record, int numReduceTasks) {
//extract the group key from composite key
String groupKey = key.toString().split("\\|")[0];
return (groupKey.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
}


////////////GROUP COMAPRATOR
public int compare(WritableComparable a, WritableComparable b) {
//compare to text objects
String thisGroupKey = ((Text) a).toString().split("\\|")[0];
String otherGroupKey = ((Text) b).toString().split("\\|")[0];
//extract
return thisGroupKey.compareTo(otherGroupKey);
}


////////////SORT COMPARATOR
is similar to group comparator and is in map phase and gets done quick



//////////REDUCER
public void reduce(Text key, Iterable<HCatRecord> records, Context context)
throws IOException, InterruptedException {
log.info("in reducer for key " + key.toString());
Iterator<HCatRecord> recordsIter = records.iterator();
//we are only interested in the first record after sorting and grouping
if(recordsIter.hasNext()){
HCatRecord rec = recordsIter.next();
context.write(nw, rec);
log.info("returned record >> " + rec.toString());
}
}


On Fri, Aug 30, 2013 at 9:24 AM, Adeel Qureshi <ad...@gmail.com>wrote:

> yup it was negative and by doing this now it seems to be working fine
>
>
> On Fri, Aug 30, 2013 at 3:09 AM, Shekhar Sharma <sh...@gmail.com>wrote:
>
>> Is the hash code of that key  is negative.?
>> Do something like this
>>
>> return groupKey.hashCode() & Integer.MAX_VALUE % numParts;
>>
>> Regards,
>> Som Shekhar Sharma
>> +91-8197243810
>>
>>
>> On Fri, Aug 30, 2013 at 6:25 AM, Adeel Qureshi <ad...@gmail.com>
>> wrote:
>> > okay so when i specify the number of reducers e.g. in my example i m
>> using 4
>> > (for a much smaller data set) it works if I use a single column in my
>> > composite key .. but if I add multiple columns in the composite key
>> > separated by a delimi .. it then throws the illegal partition error
>> (keys
>> > before the pipe are group keys and after the pipe are the sort keys and
>> my
>> > partioner only uses the group keys
>> >
>> > java.io.IOException: Illegal partition for Atlanta:GA|Atlanta:GA:1:Adeel
>> > (-1)
>> >         at
>> >
>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1073)
>> >         at
>> >
>> org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:691)
>> >         at
>> >
>> org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
>> >         at com.att.hadoop.hivesort.HSMapper.map(HSMapper.java:39)
>> >         at com.att.hadoop.hivesort.HSMapper.map(HSMapper.java:1)
>> >         at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>> >         at
>> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
>> >         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
>> >         at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>> >         at java.security.AccessController.doPrivileged(Native Method)
>> >         at javax.security.auth.Subject.doAs(Subject.java:396)
>> >         at
>> >
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1136)
>> >         at org.apache.hadoop.mapred.Child.main(Child.java:249)
>> >
>> >
>> > public int getPartition(Text key, HCatRecord record, int numParts) {
>> > //extract the group key from composite key
>> > String groupKey = key.toString().split("\\|")[0];
>> > return groupKey.hashCode() % numParts;
>> > }
>> >
>> >
>> > On Thu, Aug 29, 2013 at 8:31 PM, Shekhar Sharma <sh...@gmail.com>
>> > wrote:
>> >>
>> >> No...partitionr decides which keys should go to which reducer...and
>> >> number of reducers you need to decide...No of reducers depends on
>> >> factors like number of key value pair, use case etc
>> >> Regards,
>> >> Som Shekhar Sharma
>> >> +91-8197243810
>> >>
>> >>
>> >> On Fri, Aug 30, 2013 at 5:54 AM, Adeel Qureshi <adeelmahmood@gmail.com
>> >
>> >> wrote:
>> >> > so it cant figure out an appropriate number of reducers as it does
>> for
>> >> > mappers .. in my case hadoop is using 2100+ mappers and then only 1
>> >> > reducer
>> >> > .. since im overriding the partitioner class shouldnt that decide how
>> >> > manyredeucers there should be based on how many different partition
>> >> > values
>> >> > being returned by the custom partiotioner
>> >> >
>> >> >
>> >> > On Thu, Aug 29, 2013 at 7:38 PM, Ian Wrigley <ia...@cloudera.com>
>> wrote:
>> >> >>
>> >> >> If you don't specify the number of Reducers, Hadoop will use the
>> >> >> default
>> >> >> -- which, unless you've changed it, is 1.
>> >> >>
>> >> >> Regards
>> >> >>
>> >> >> Ian.
>> >> >>
>> >> >> On Aug 29, 2013, at 4:23 PM, Adeel Qureshi <ad...@gmail.com>
>> >> >> wrote:
>> >> >>
>> >> >> I have implemented secondary sort in my MR job and for some reason
>> if i
>> >> >> dont specify the number of reducers it uses 1 which doesnt seems
>> right
>> >> >> because im working with 800M+ records and one reducer slows things
>> down
>> >> >> significantly. Is this some kind of limitation with the secondary
>> sort
>> >> >> that
>> >> >> it has to use a single reducer .. that kind of would defeat the
>> purpose
>> >> >> of
>> >> >> having a scalable solution such as secondary sort. I would
>> appreciate
>> >> >> any
>> >> >> help.
>> >> >>
>> >> >> Thanks
>> >> >> Adeel
>> >> >>
>> >> >>
>> >> >>
>> >> >> ---
>> >> >> Ian Wrigley
>> >> >> Sr. Curriculum Manager
>> >> >> Cloudera, Inc
>> >> >> Cell: (323) 819 4075
>> >> >>
>> >> >
>> >
>> >
>>
>
>

Re: secondary sort - number of reducers

Posted by Adeel Qureshi <ad...@gmail.com>.

my secondary sort on multiple keys seem to work fine with smaller data sets
but with bigger data sets (like 256 gig and 800M+ records) the mapper phase
gets done pretty quick (about 15 mins) but then the reducer phase seem to
take forever. I am using 255 reducers.

basic idea is that my composite key has both group and sort keys in it
which i parse in the appropriate comparator classes to perform grouping and
sorting .. my thinking is that mappers is where most of the work is done
1. mapper itself (create composite key and value)
2. recods sorting
3. partiotioner

if all this gets done in 15 mins then reducer has the simple task of
1. grouping comparator
2. reducer itself (simply output records)

should take less time than mappers .. instead it essentially gets stuck in
reduce phase .. im gonna paste my code here to see if anything stands out
as a fundamental design issue

//////PARTITIONER
public int getPartition(Text key, HCatRecord record, int numReduceTasks) {
//extract the group key from composite key
String groupKey = key.toString().split("\\|")[0];
return (groupKey.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
}


////////////GROUP COMAPRATOR
public int compare(WritableComparable a, WritableComparable b) {
//compare to text objects
String thisGroupKey = ((Text) a).toString().split("\\|")[0];
String otherGroupKey = ((Text) b).toString().split("\\|")[0];
//extract
return thisGroupKey.compareTo(otherGroupKey);
}


////////////SORT COMPARATOR
is similar to group comparator and is in map phase and gets done quick



//////////REDUCER
public void reduce(Text key, Iterable<HCatRecord> records, Context context)
throws IOException, InterruptedException {
log.info("in reducer for key " + key.toString());
Iterator<HCatRecord> recordsIter = records.iterator();
//we are only interested in the first record after sorting and grouping
if(recordsIter.hasNext()){
HCatRecord rec = recordsIter.next();
context.write(nw, rec);
log.info("returned record >> " + rec.toString());
}
}


On Fri, Aug 30, 2013 at 9:24 AM, Adeel Qureshi <ad...@gmail.com>wrote:

> yup it was negative and by doing this now it seems to be working fine
>
>
> On Fri, Aug 30, 2013 at 3:09 AM, Shekhar Sharma <sh...@gmail.com>wrote:
>
>> Is the hash code of that key  is negative.?
>> Do something like this
>>
>> return groupKey.hashCode() & Integer.MAX_VALUE % numParts;
>>
>> Regards,
>> Som Shekhar Sharma
>> +91-8197243810
>>
>>
>> On Fri, Aug 30, 2013 at 6:25 AM, Adeel Qureshi <ad...@gmail.com>
>> wrote:
>> > okay so when i specify the number of reducers e.g. in my example i m
>> using 4
>> > (for a much smaller data set) it works if I use a single column in my
>> > composite key .. but if I add multiple columns in the composite key
>> > separated by a delimi .. it then throws the illegal partition error
>> (keys
>> > before the pipe are group keys and after the pipe are the sort keys and
>> my
>> > partioner only uses the group keys
>> >
>> > java.io.IOException: Illegal partition for Atlanta:GA|Atlanta:GA:1:Adeel
>> > (-1)
>> >         at
>> >
>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1073)
>> >         at
>> >
>> org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:691)
>> >         at
>> >
>> org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
>> >         at com.att.hadoop.hivesort.HSMapper.map(HSMapper.java:39)
>> >         at com.att.hadoop.hivesort.HSMapper.map(HSMapper.java:1)
>> >         at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>> >         at
>> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
>> >         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
>> >         at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>> >         at java.security.AccessController.doPrivileged(Native Method)
>> >         at javax.security.auth.Subject.doAs(Subject.java:396)
>> >         at
>> >
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1136)
>> >         at org.apache.hadoop.mapred.Child.main(Child.java:249)
>> >
>> >
>> > public int getPartition(Text key, HCatRecord record, int numParts) {
>> > //extract the group key from composite key
>> > String groupKey = key.toString().split("\\|")[0];
>> > return groupKey.hashCode() % numParts;
>> > }
>> >
>> >
>> > On Thu, Aug 29, 2013 at 8:31 PM, Shekhar Sharma <sh...@gmail.com>
>> > wrote:
>> >>
>> >> No...partitionr decides which keys should go to which reducer...and
>> >> number of reducers you need to decide...No of reducers depends on
>> >> factors like number of key value pair, use case etc
>> >> Regards,
>> >> Som Shekhar Sharma
>> >> +91-8197243810
>> >>
>> >>
>> >> On Fri, Aug 30, 2013 at 5:54 AM, Adeel Qureshi <adeelmahmood@gmail.com
>> >
>> >> wrote:
>> >> > so it cant figure out an appropriate number of reducers as it does
>> for
>> >> > mappers .. in my case hadoop is using 2100+ mappers and then only 1
>> >> > reducer
>> >> > .. since im overriding the partitioner class shouldnt that decide how
>> >> > manyredeucers there should be based on how many different partition
>> >> > values
>> >> > being returned by the custom partiotioner
>> >> >
>> >> >
>> >> > On Thu, Aug 29, 2013 at 7:38 PM, Ian Wrigley <ia...@cloudera.com>
>> wrote:
>> >> >>
>> >> >> If you don't specify the number of Reducers, Hadoop will use the
>> >> >> default
>> >> >> -- which, unless you've changed it, is 1.
>> >> >>
>> >> >> Regards
>> >> >>
>> >> >> Ian.
>> >> >>
>> >> >> On Aug 29, 2013, at 4:23 PM, Adeel Qureshi <ad...@gmail.com>
>> >> >> wrote:
>> >> >>
>> >> >> I have implemented secondary sort in my MR job and for some reason
>> if i
>> >> >> dont specify the number of reducers it uses 1 which doesnt seems
>> right
>> >> >> because im working with 800M+ records and one reducer slows things
>> down
>> >> >> significantly. Is this some kind of limitation with the secondary
>> sort
>> >> >> that
>> >> >> it has to use a single reducer .. that kind of would defeat the
>> purpose
>> >> >> of
>> >> >> having a scalable solution such as secondary sort. I would
>> appreciate
>> >> >> any
>> >> >> help.
>> >> >>
>> >> >> Thanks
>> >> >> Adeel
>> >> >>
>> >> >>
>> >> >>
>> >> >> ---
>> >> >> Ian Wrigley
>> >> >> Sr. Curriculum Manager
>> >> >> Cloudera, Inc
>> >> >> Cell: (323) 819 4075
>> >> >>
>> >> >
>> >
>> >
>>
>
>

Re: secondary sort - number of reducers

Posted by Adeel Qureshi <ad...@gmail.com>.

my secondary sort on multiple keys seem to work fine with smaller data sets
but with bigger data sets (like 256 gig and 800M+ records) the mapper phase
gets done pretty quick (about 15 mins) but then the reducer phase seem to
take forever. I am using 255 reducers.

basic idea is that my composite key has both group and sort keys in it
which i parse in the appropriate comparator classes to perform grouping and
sorting .. my thinking is that mappers is where most of the work is done
1. mapper itself (create composite key and value)
2. recods sorting
3. partiotioner

if all this gets done in 15 mins then reducer has the simple task of
1. grouping comparator
2. reducer itself (simply output records)

should take less time than mappers .. instead it essentially gets stuck in
reduce phase .. im gonna paste my code here to see if anything stands out
as a fundamental design issue

//////PARTITIONER
public int getPartition(Text key, HCatRecord record, int numReduceTasks) {
//extract the group key from composite key
String groupKey = key.toString().split("\\|")[0];
return (groupKey.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
}


////////////GROUP COMAPRATOR
public int compare(WritableComparable a, WritableComparable b) {
//compare to text objects
String thisGroupKey = ((Text) a).toString().split("\\|")[0];
String otherGroupKey = ((Text) b).toString().split("\\|")[0];
//extract
return thisGroupKey.compareTo(otherGroupKey);
}


////////////SORT COMPARATOR
is similar to group comparator and is in map phase and gets done quick



//////////REDUCER
public void reduce(Text key, Iterable<HCatRecord> records, Context context)
throws IOException, InterruptedException {
log.info("in reducer for key " + key.toString());
Iterator<HCatRecord> recordsIter = records.iterator();
//we are only interested in the first record after sorting and grouping
if(recordsIter.hasNext()){
HCatRecord rec = recordsIter.next();
context.write(nw, rec);
log.info("returned record >> " + rec.toString());
}
}


On Fri, Aug 30, 2013 at 9:24 AM, Adeel Qureshi <ad...@gmail.com>wrote:

> yup it was negative and by doing this now it seems to be working fine
>
>
> On Fri, Aug 30, 2013 at 3:09 AM, Shekhar Sharma <sh...@gmail.com>wrote:
>
>> Is the hash code of that key  is negative.?
>> Do something like this
>>
>> return groupKey.hashCode() & Integer.MAX_VALUE % numParts;
>>
>> Regards,
>> Som Shekhar Sharma
>> +91-8197243810
>>
>>
>> On Fri, Aug 30, 2013 at 6:25 AM, Adeel Qureshi <ad...@gmail.com>
>> wrote:
>> > okay so when i specify the number of reducers e.g. in my example i m
>> using 4
>> > (for a much smaller data set) it works if I use a single column in my
>> > composite key .. but if I add multiple columns in the composite key
>> > separated by a delimi .. it then throws the illegal partition error
>> (keys
>> > before the pipe are group keys and after the pipe are the sort keys and
>> my
>> > partioner only uses the group keys
>> >
>> > java.io.IOException: Illegal partition for Atlanta:GA|Atlanta:GA:1:Adeel
>> > (-1)
>> >         at
>> >
>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1073)
>> >         at
>> >
>> org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:691)
>> >         at
>> >
>> org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
>> >         at com.att.hadoop.hivesort.HSMapper.map(HSMapper.java:39)
>> >         at com.att.hadoop.hivesort.HSMapper.map(HSMapper.java:1)
>> >         at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>> >         at
>> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
>> >         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
>> >         at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>> >         at java.security.AccessController.doPrivileged(Native Method)
>> >         at javax.security.auth.Subject.doAs(Subject.java:396)
>> >         at
>> >
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1136)
>> >         at org.apache.hadoop.mapred.Child.main(Child.java:249)
>> >
>> >
>> > public int getPartition(Text key, HCatRecord record, int numParts) {
>> > //extract the group key from composite key
>> > String groupKey = key.toString().split("\\|")[0];
>> > return groupKey.hashCode() % numParts;
>> > }
>> >
>> >
>> > On Thu, Aug 29, 2013 at 8:31 PM, Shekhar Sharma <sh...@gmail.com>
>> > wrote:
>> >>
>> >> No...partitionr decides which keys should go to which reducer...and
>> >> number of reducers you need to decide...No of reducers depends on
>> >> factors like number of key value pair, use case etc
>> >> Regards,
>> >> Som Shekhar Sharma
>> >> +91-8197243810
>> >>
>> >>
>> >> On Fri, Aug 30, 2013 at 5:54 AM, Adeel Qureshi <adeelmahmood@gmail.com
>> >
>> >> wrote:
>> >> > so it cant figure out an appropriate number of reducers as it does
>> for
>> >> > mappers .. in my case hadoop is using 2100+ mappers and then only 1
>> >> > reducer
>> >> > .. since im overriding the partitioner class shouldnt that decide how
>> >> > manyredeucers there should be based on how many different partition
>> >> > values
>> >> > being returned by the custom partiotioner
>> >> >
>> >> >
>> >> > On Thu, Aug 29, 2013 at 7:38 PM, Ian Wrigley <ia...@cloudera.com>
>> wrote:
>> >> >>
>> >> >> If you don't specify the number of Reducers, Hadoop will use the
>> >> >> default
>> >> >> -- which, unless you've changed it, is 1.
>> >> >>
>> >> >> Regards
>> >> >>
>> >> >> Ian.
>> >> >>
>> >> >> On Aug 29, 2013, at 4:23 PM, Adeel Qureshi <ad...@gmail.com>
>> >> >> wrote:
>> >> >>
>> >> >> I have implemented secondary sort in my MR job and for some reason
>> if i
>> >> >> dont specify the number of reducers it uses 1 which doesnt seems
>> right
>> >> >> because im working with 800M+ records and one reducer slows things
>> down
>> >> >> significantly. Is this some kind of limitation with the secondary
>> sort
>> >> >> that
>> >> >> it has to use a single reducer .. that kind of would defeat the
>> purpose
>> >> >> of
>> >> >> having a scalable solution such as secondary sort. I would
>> appreciate
>> >> >> any
>> >> >> help.
>> >> >>
>> >> >> Thanks
>> >> >> Adeel
>> >> >>
>> >> >>
>> >> >>
>> >> >> ---
>> >> >> Ian Wrigley
>> >> >> Sr. Curriculum Manager
>> >> >> Cloudera, Inc
>> >> >> Cell: (323) 819 4075
>> >> >>
>> >> >
>> >
>> >
>>
>
>

Re: secondary sort - number of reducers

Posted by Adeel Qureshi <ad...@gmail.com>.

yup it was negative and by doing this now it seems to be working fine


On Fri, Aug 30, 2013 at 3:09 AM, Shekhar Sharma <sh...@gmail.com>wrote:

> Is the hash code of that key  is negative.?
> Do something like this
>
> return groupKey.hashCode() & Integer.MAX_VALUE % numParts;
>
> Regards,
> Som Shekhar Sharma
> +91-8197243810
>
>
> On Fri, Aug 30, 2013 at 6:25 AM, Adeel Qureshi <ad...@gmail.com>
> wrote:
> > okay so when i specify the number of reducers e.g. in my example i m
> using 4
> > (for a much smaller data set) it works if I use a single column in my
> > composite key .. but if I add multiple columns in the composite key
> > separated by a delimi .. it then throws the illegal partition error (keys
> > before the pipe are group keys and after the pipe are the sort keys and
> my
> > partioner only uses the group keys
> >
> > java.io.IOException: Illegal partition for Atlanta:GA|Atlanta:GA:1:Adeel
> > (-1)
> >         at
> >
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1073)
> >         at
> >
> org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:691)
> >         at
> >
> org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
> >         at com.att.hadoop.hivesort.HSMapper.map(HSMapper.java:39)
> >         at com.att.hadoop.hivesort.HSMapper.map(HSMapper.java:1)
> >         at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
> >         at
> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
> >         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
> >         at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
> >         at java.security.AccessController.doPrivileged(Native Method)
> >         at javax.security.auth.Subject.doAs(Subject.java:396)
> >         at
> >
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1136)
> >         at org.apache.hadoop.mapred.Child.main(Child.java:249)
> >
> >
> > public int getPartition(Text key, HCatRecord record, int numParts) {
> > //extract the group key from composite key
> > String groupKey = key.toString().split("\\|")[0];
> > return groupKey.hashCode() % numParts;
> > }
> >
> >
> > On Thu, Aug 29, 2013 at 8:31 PM, Shekhar Sharma <sh...@gmail.com>
> > wrote:
> >>
> >> No...partitionr decides which keys should go to which reducer...and
> >> number of reducers you need to decide...No of reducers depends on
> >> factors like number of key value pair, use case etc
> >> Regards,
> >> Som Shekhar Sharma
> >> +91-8197243810
> >>
> >>
> >> On Fri, Aug 30, 2013 at 5:54 AM, Adeel Qureshi <ad...@gmail.com>
> >> wrote:
> >> > so it cant figure out an appropriate number of reducers as it does for
> >> > mappers .. in my case hadoop is using 2100+ mappers and then only 1
> >> > reducer
> >> > .. since im overriding the partitioner class shouldnt that decide how
> >> > manyredeucers there should be based on how many different partition
> >> > values
> >> > being returned by the custom partiotioner
> >> >
> >> >
> >> > On Thu, Aug 29, 2013 at 7:38 PM, Ian Wrigley <ia...@cloudera.com>
> wrote:
> >> >>
> >> >> If you don't specify the number of Reducers, Hadoop will use the
> >> >> default
> >> >> -- which, unless you've changed it, is 1.
> >> >>
> >> >> Regards
> >> >>
> >> >> Ian.
> >> >>
> >> >> On Aug 29, 2013, at 4:23 PM, Adeel Qureshi <ad...@gmail.com>
> >> >> wrote:
> >> >>
> >> >> I have implemented secondary sort in my MR job and for some reason
> if i
> >> >> dont specify the number of reducers it uses 1 which doesnt seems
> right
> >> >> because im working with 800M+ records and one reducer slows things
> down
> >> >> significantly. Is this some kind of limitation with the secondary
> sort
> >> >> that
> >> >> it has to use a single reducer .. that kind of would defeat the
> purpose
> >> >> of
> >> >> having a scalable solution such as secondary sort. I would appreciate
> >> >> any
> >> >> help.
> >> >>
> >> >> Thanks
> >> >> Adeel
> >> >>
> >> >>
> >> >>
> >> >> ---
> >> >> Ian Wrigley
> >> >> Sr. Curriculum Manager
> >> >> Cloudera, Inc
> >> >> Cell: (323) 819 4075
> >> >>
> >> >
> >
> >
>

Re: secondary sort - number of reducers

Posted by Adeel Qureshi <ad...@gmail.com>.

yup it was negative and by doing this now it seems to be working fine


On Fri, Aug 30, 2013 at 3:09 AM, Shekhar Sharma <sh...@gmail.com>wrote:

> Is the hash code of that key  is negative.?
> Do something like this
>
> return groupKey.hashCode() & Integer.MAX_VALUE % numParts;
>
> Regards,
> Som Shekhar Sharma
> +91-8197243810
>
>
> On Fri, Aug 30, 2013 at 6:25 AM, Adeel Qureshi <ad...@gmail.com>
> wrote:
> > okay so when i specify the number of reducers e.g. in my example i m
> using 4
> > (for a much smaller data set) it works if I use a single column in my
> > composite key .. but if I add multiple columns in the composite key
> > separated by a delimi .. it then throws the illegal partition error (keys
> > before the pipe are group keys and after the pipe are the sort keys and
> my
> > partioner only uses the group keys
> >
> > java.io.IOException: Illegal partition for Atlanta:GA|Atlanta:GA:1:Adeel
> > (-1)
> >         at
> >
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1073)
> >         at
> >
> org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:691)
> >         at
> >
> org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
> >         at com.att.hadoop.hivesort.HSMapper.map(HSMapper.java:39)
> >         at com.att.hadoop.hivesort.HSMapper.map(HSMapper.java:1)
> >         at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
> >         at
> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
> >         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
> >         at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
> >         at java.security.AccessController.doPrivileged(Native Method)
> >         at javax.security.auth.Subject.doAs(Subject.java:396)
> >         at
> >
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1136)
> >         at org.apache.hadoop.mapred.Child.main(Child.java:249)
> >
> >
> > public int getPartition(Text key, HCatRecord record, int numParts) {
> > //extract the group key from composite key
> > String groupKey = key.toString().split("\\|")[0];
> > return groupKey.hashCode() % numParts;
> > }
> >
> >
> > On Thu, Aug 29, 2013 at 8:31 PM, Shekhar Sharma <sh...@gmail.com>
> > wrote:
> >>
> >> No...partitionr decides which keys should go to which reducer...and
> >> number of reducers you need to decide...No of reducers depends on
> >> factors like number of key value pair, use case etc
> >> Regards,
> >> Som Shekhar Sharma
> >> +91-8197243810
> >>
> >>
> >> On Fri, Aug 30, 2013 at 5:54 AM, Adeel Qureshi <ad...@gmail.com>
> >> wrote:
> >> > so it cant figure out an appropriate number of reducers as it does for
> >> > mappers .. in my case hadoop is using 2100+ mappers and then only 1
> >> > reducer
> >> > .. since im overriding the partitioner class shouldnt that decide how
> >> > manyredeucers there should be based on how many different partition
> >> > values
> >> > being returned by the custom partiotioner
> >> >
> >> >
> >> > On Thu, Aug 29, 2013 at 7:38 PM, Ian Wrigley <ia...@cloudera.com>
> wrote:
> >> >>
> >> >> If you don't specify the number of Reducers, Hadoop will use the
> >> >> default
> >> >> -- which, unless you've changed it, is 1.
> >> >>
> >> >> Regards
> >> >>
> >> >> Ian.
> >> >>
> >> >> On Aug 29, 2013, at 4:23 PM, Adeel Qureshi <ad...@gmail.com>
> >> >> wrote:
> >> >>
> >> >> I have implemented secondary sort in my MR job and for some reason
> if i
> >> >> dont specify the number of reducers it uses 1 which doesnt seems
> right
> >> >> because im working with 800M+ records and one reducer slows things
> down
> >> >> significantly. Is this some kind of limitation with the secondary
> sort
> >> >> that
> >> >> it has to use a single reducer .. that kind of would defeat the
> purpose
> >> >> of
> >> >> having a scalable solution such as secondary sort. I would appreciate
> >> >> any
> >> >> help.
> >> >>
> >> >> Thanks
> >> >> Adeel
> >> >>
> >> >>
> >> >>
> >> >> ---
> >> >> Ian Wrigley
> >> >> Sr. Curriculum Manager
> >> >> Cloudera, Inc
> >> >> Cell: (323) 819 4075
> >> >>
> >> >
> >
> >
>

Re: secondary sort - number of reducers

Posted by Adeel Qureshi <ad...@gmail.com>.

yup it was negative and by doing this now it seems to be working fine


On Fri, Aug 30, 2013 at 3:09 AM, Shekhar Sharma <sh...@gmail.com>wrote:

> Is the hash code of that key  is negative.?
> Do something like this
>
> return groupKey.hashCode() & Integer.MAX_VALUE % numParts;
>
> Regards,
> Som Shekhar Sharma
> +91-8197243810
>
>
> On Fri, Aug 30, 2013 at 6:25 AM, Adeel Qureshi <ad...@gmail.com>
> wrote:
> > okay so when i specify the number of reducers e.g. in my example i m
> using 4
> > (for a much smaller data set) it works if I use a single column in my
> > composite key .. but if I add multiple columns in the composite key
> > separated by a delimi .. it then throws the illegal partition error (keys
> > before the pipe are group keys and after the pipe are the sort keys and
> my
> > partioner only uses the group keys
> >
> > java.io.IOException: Illegal partition for Atlanta:GA|Atlanta:GA:1:Adeel
> > (-1)
> >         at
> >
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1073)
> >         at
> >
> org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:691)
> >         at
> >
> org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
> >         at com.att.hadoop.hivesort.HSMapper.map(HSMapper.java:39)
> >         at com.att.hadoop.hivesort.HSMapper.map(HSMapper.java:1)
> >         at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
> >         at
> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
> >         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
> >         at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
> >         at java.security.AccessController.doPrivileged(Native Method)
> >         at javax.security.auth.Subject.doAs(Subject.java:396)
> >         at
> >
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1136)
> >         at org.apache.hadoop.mapred.Child.main(Child.java:249)
> >
> >
> > public int getPartition(Text key, HCatRecord record, int numParts) {
> > //extract the group key from composite key
> > String groupKey = key.toString().split("\\|")[0];
> > return groupKey.hashCode() % numParts;
> > }
> >
> >
> > On Thu, Aug 29, 2013 at 8:31 PM, Shekhar Sharma <sh...@gmail.com>
> > wrote:
> >>
> >> No...partitionr decides which keys should go to which reducer...and
> >> number of reducers you need to decide...No of reducers depends on
> >> factors like number of key value pair, use case etc
> >> Regards,
> >> Som Shekhar Sharma
> >> +91-8197243810
> >>
> >>
> >> On Fri, Aug 30, 2013 at 5:54 AM, Adeel Qureshi <ad...@gmail.com>
> >> wrote:
> >> > so it cant figure out an appropriate number of reducers as it does for
> >> > mappers .. in my case hadoop is using 2100+ mappers and then only 1
> >> > reducer
> >> > .. since im overriding the partitioner class shouldnt that decide how
> >> > manyredeucers there should be based on how many different partition
> >> > values
> >> > being returned by the custom partiotioner
> >> >
> >> >
> >> > On Thu, Aug 29, 2013 at 7:38 PM, Ian Wrigley <ia...@cloudera.com>
> wrote:
> >> >>
> >> >> If you don't specify the number of Reducers, Hadoop will use the
> >> >> default
> >> >> -- which, unless you've changed it, is 1.
> >> >>
> >> >> Regards
> >> >>
> >> >> Ian.
> >> >>
> >> >> On Aug 29, 2013, at 4:23 PM, Adeel Qureshi <ad...@gmail.com>
> >> >> wrote:
> >> >>
> >> >> I have implemented secondary sort in my MR job and for some reason
> if i
> >> >> dont specify the number of reducers it uses 1 which doesnt seems
> right
> >> >> because im working with 800M+ records and one reducer slows things
> down
> >> >> significantly. Is this some kind of limitation with the secondary
> sort
> >> >> that
> >> >> it has to use a single reducer .. that kind of would defeat the
> purpose
> >> >> of
> >> >> having a scalable solution such as secondary sort. I would appreciate
> >> >> any
> >> >> help.
> >> >>
> >> >> Thanks
> >> >> Adeel
> >> >>
> >> >>
> >> >>
> >> >> ---
> >> >> Ian Wrigley
> >> >> Sr. Curriculum Manager
> >> >> Cloudera, Inc
> >> >> Cell: (323) 819 4075
> >> >>
> >> >
> >
> >
>

Re: secondary sort - number of reducers

Posted by Adeel Qureshi <ad...@gmail.com>.

yup it was negative and by doing this now it seems to be working fine


On Fri, Aug 30, 2013 at 3:09 AM, Shekhar Sharma <sh...@gmail.com>wrote:

> Is the hash code of that key  is negative.?
> Do something like this
>
> return groupKey.hashCode() & Integer.MAX_VALUE % numParts;
>
> Regards,
> Som Shekhar Sharma
> +91-8197243810
>
>
> On Fri, Aug 30, 2013 at 6:25 AM, Adeel Qureshi <ad...@gmail.com>
> wrote:
> > okay so when i specify the number of reducers e.g. in my example i m
> using 4
> > (for a much smaller data set) it works if I use a single column in my
> > composite key .. but if I add multiple columns in the composite key
> > separated by a delimi .. it then throws the illegal partition error (keys
> > before the pipe are group keys and after the pipe are the sort keys and
> my
> > partioner only uses the group keys
> >
> > java.io.IOException: Illegal partition for Atlanta:GA|Atlanta:GA:1:Adeel
> > (-1)
> >         at
> >
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1073)
> >         at
> >
> org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:691)
> >         at
> >
> org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
> >         at com.att.hadoop.hivesort.HSMapper.map(HSMapper.java:39)
> >         at com.att.hadoop.hivesort.HSMapper.map(HSMapper.java:1)
> >         at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
> >         at
> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
> >         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
> >         at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
> >         at java.security.AccessController.doPrivileged(Native Method)
> >         at javax.security.auth.Subject.doAs(Subject.java:396)
> >         at
> >
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1136)
> >         at org.apache.hadoop.mapred.Child.main(Child.java:249)
> >
> >
> > public int getPartition(Text key, HCatRecord record, int numParts) {
> > //extract the group key from composite key
> > String groupKey = key.toString().split("\\|")[0];
> > return groupKey.hashCode() % numParts;
> > }
> >
> >
> > On Thu, Aug 29, 2013 at 8:31 PM, Shekhar Sharma <sh...@gmail.com>
> > wrote:
> >>
> >> No...partitionr decides which keys should go to which reducer...and
> >> number of reducers you need to decide...No of reducers depends on
> >> factors like number of key value pair, use case etc
> >> Regards,
> >> Som Shekhar Sharma
> >> +91-8197243810
> >>
> >>
> >> On Fri, Aug 30, 2013 at 5:54 AM, Adeel Qureshi <ad...@gmail.com>
> >> wrote:
> >> > so it cant figure out an appropriate number of reducers as it does for
> >> > mappers .. in my case hadoop is using 2100+ mappers and then only 1
> >> > reducer
> >> > .. since im overriding the partitioner class shouldnt that decide how
> >> > manyredeucers there should be based on how many different partition
> >> > values
> >> > being returned by the custom partiotioner
> >> >
> >> >
> >> > On Thu, Aug 29, 2013 at 7:38 PM, Ian Wrigley <ia...@cloudera.com>
> wrote:
> >> >>
> >> >> If you don't specify the number of Reducers, Hadoop will use the
> >> >> default
> >> >> -- which, unless you've changed it, is 1.
> >> >>
> >> >> Regards
> >> >>
> >> >> Ian.
> >> >>
> >> >> On Aug 29, 2013, at 4:23 PM, Adeel Qureshi <ad...@gmail.com>
> >> >> wrote:
> >> >>
> >> >> I have implemented secondary sort in my MR job and for some reason
> if i
> >> >> dont specify the number of reducers it uses 1 which doesnt seems
> right
> >> >> because im working with 800M+ records and one reducer slows things
> down
> >> >> significantly. Is this some kind of limitation with the secondary
> sort
> >> >> that
> >> >> it has to use a single reducer .. that kind of would defeat the
> purpose
> >> >> of
> >> >> having a scalable solution such as secondary sort. I would appreciate
> >> >> any
> >> >> help.
> >> >>
> >> >> Thanks
> >> >> Adeel
> >> >>
> >> >>
> >> >>
> >> >> ---
> >> >> Ian Wrigley
> >> >> Sr. Curriculum Manager
> >> >> Cloudera, Inc
> >> >> Cell: (323) 819 4075
> >> >>
> >> >
> >
> >
>

Re: secondary sort - number of reducers

Posted by Shekhar Sharma <sh...@gmail.com>.

Is the hash code of that key  is negative.?
Do something like this

return groupKey.hashCode() & Integer.MAX_VALUE % numParts;

Regards,
Som Shekhar Sharma
+91-8197243810


On Fri, Aug 30, 2013 at 6:25 AM, Adeel Qureshi <ad...@gmail.com> wrote:
> okay so when i specify the number of reducers e.g. in my example i m using 4
> (for a much smaller data set) it works if I use a single column in my
> composite key .. but if I add multiple columns in the composite key
> separated by a delimi .. it then throws the illegal partition error (keys
> before the pipe are group keys and after the pipe are the sort keys and my
> partioner only uses the group keys
>
> java.io.IOException: Illegal partition for Atlanta:GA|Atlanta:GA:1:Adeel
> (-1)
>         at
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1073)
>         at
> org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:691)
>         at
> org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
>         at com.att.hadoop.hivesort.HSMapper.map(HSMapper.java:39)
>         at com.att.hadoop.hivesort.HSMapper.map(HSMapper.java:1)
>         at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>         at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
>         at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:396)
>         at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1136)
>         at org.apache.hadoop.mapred.Child.main(Child.java:249)
>
>
> public int getPartition(Text key, HCatRecord record, int numParts) {
> //extract the group key from composite key
> String groupKey = key.toString().split("\\|")[0];
> return groupKey.hashCode() % numParts;
> }
>
>
> On Thu, Aug 29, 2013 at 8:31 PM, Shekhar Sharma <sh...@gmail.com>
> wrote:
>>
>> No...partitionr decides which keys should go to which reducer...and
>> number of reducers you need to decide...No of reducers depends on
>> factors like number of key value pair, use case etc
>> Regards,
>> Som Shekhar Sharma
>> +91-8197243810
>>
>>
>> On Fri, Aug 30, 2013 at 5:54 AM, Adeel Qureshi <ad...@gmail.com>
>> wrote:
>> > so it cant figure out an appropriate number of reducers as it does for
>> > mappers .. in my case hadoop is using 2100+ mappers and then only 1
>> > reducer
>> > .. since im overriding the partitioner class shouldnt that decide how
>> > manyredeucers there should be based on how many different partition
>> > values
>> > being returned by the custom partiotioner
>> >
>> >
>> > On Thu, Aug 29, 2013 at 7:38 PM, Ian Wrigley <ia...@cloudera.com> wrote:
>> >>
>> >> If you don't specify the number of Reducers, Hadoop will use the
>> >> default
>> >> -- which, unless you've changed it, is 1.
>> >>
>> >> Regards
>> >>
>> >> Ian.
>> >>
>> >> On Aug 29, 2013, at 4:23 PM, Adeel Qureshi <ad...@gmail.com>
>> >> wrote:
>> >>
>> >> I have implemented secondary sort in my MR job and for some reason if i
>> >> dont specify the number of reducers it uses 1 which doesnt seems right
>> >> because im working with 800M+ records and one reducer slows things down
>> >> significantly. Is this some kind of limitation with the secondary sort
>> >> that
>> >> it has to use a single reducer .. that kind of would defeat the purpose
>> >> of
>> >> having a scalable solution such as secondary sort. I would appreciate
>> >> any
>> >> help.
>> >>
>> >> Thanks
>> >> Adeel
>> >>
>> >>
>> >>
>> >> ---
>> >> Ian Wrigley
>> >> Sr. Curriculum Manager
>> >> Cloudera, Inc
>> >> Cell: (323) 819 4075
>> >>
>> >
>
>

RE: secondary sort - number of reducers

Posted by java8964 java8964 <ja...@hotmail.com>.

The method getPartition() needs to return a positive number. Simply use hashCode() method is not enough.
See the Hadoop HashPartitioner implementation:








return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
When I first read this code, I always wonder why not use Math.abs? Is ( & Integer.MAX_VALUE) faster?
Yong
Date: Thu, 29 Aug 2013 20:55:46 -0400
Subject: Re: secondary sort - number of reducers
From: adeelmahmood@gmail.com
To: user@hadoop.apache.org

okay so when i specify the number of reducers e.g. in my example i m using 4 (for a much smaller data set) it works if I use a single column in my composite key .. but if I add multiple columns in the composite key separated by a delimi .. it then throws the illegal partition error (keys before the pipe are group keys and after the pipe are the sort keys and my partioner only uses the group keys

java.io.IOException: Illegal partition for Atlanta:GA|Atlanta:GA:1:Adeel (-1)        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1073)        at org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:691)
        at org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)        at com.att.hadoop.hivesort.HSMapper.map(HSMapper.java:39)        at com.att.hadoop.hivesort.HSMapper.map(HSMapper.java:1)
        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
        at org.apache.hadoop.mapred.Child$4.run(Child.java:255)        at java.security.AccessController.doPrivileged(Native Method)        at javax.security.auth.Subject.doAs(Subject.java:396)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1136)        at org.apache.hadoop.mapred.Child.main(Child.java:249)


	public int getPartition(Text key, HCatRecord record, int numParts) {		//extract the group key from composite key
		String groupKey = key.toString().split("\\|")[0];				return groupKey.hashCode() % numParts;

	}

On Thu, Aug 29, 2013 at 8:31 PM, Shekhar Sharma <sh...@gmail.com> wrote:

No...partitionr decides which keys should go to which reducer...and


number of reducers you need to decide...No of reducers depends on

factors like number of key value pair, use case etc

Regards,

Som Shekhar Sharma

+91-8197243810





On Fri, Aug 30, 2013 at 5:54 AM, Adeel Qureshi <ad...@gmail.com> wrote:

> so it cant figure out an appropriate number of reducers as it does for

> mappers .. in my case hadoop is using 2100+ mappers and then only 1 reducer

> .. since im overriding the partitioner class shouldnt that decide how

> manyredeucers there should be based on how many different partition values

> being returned by the custom partiotioner

>

>

> On Thu, Aug 29, 2013 at 7:38 PM, Ian Wrigley <ia...@cloudera.com> wrote:

>>

>> If you don't specify the number of Reducers, Hadoop will use the default

>> -- which, unless you've changed it, is 1.

>>

>> Regards

>>

>> Ian.

>>

>> On Aug 29, 2013, at 4:23 PM, Adeel Qureshi <ad...@gmail.com> wrote:

>>

>> I have implemented secondary sort in my MR job and for some reason if i

>> dont specify the number of reducers it uses 1 which doesnt seems right

>> because im working with 800M+ records and one reducer slows things down

>> significantly. Is this some kind of limitation with the secondary sort that

>> it has to use a single reducer .. that kind of would defeat the purpose of

>> having a scalable solution such as secondary sort. I would appreciate any

>> help.

>>

>> Thanks

>> Adeel

>>

>>

>>

>> ---

>> Ian Wrigley

>> Sr. Curriculum Manager

>> Cloudera, Inc

>> Cell: (323) 819 4075

>>

>

Re: secondary sort - number of reducers

Posted by Shekhar Sharma <sh...@gmail.com>.

Is the hash code of that key  is negative.?
Do something like this

return groupKey.hashCode() & Integer.MAX_VALUE % numParts;

Regards,
Som Shekhar Sharma
+91-8197243810


On Fri, Aug 30, 2013 at 6:25 AM, Adeel Qureshi <ad...@gmail.com> wrote:
> okay so when i specify the number of reducers e.g. in my example i m using 4
> (for a much smaller data set) it works if I use a single column in my
> composite key .. but if I add multiple columns in the composite key
> separated by a delimi .. it then throws the illegal partition error (keys
> before the pipe are group keys and after the pipe are the sort keys and my
> partioner only uses the group keys
>
> java.io.IOException: Illegal partition for Atlanta:GA|Atlanta:GA:1:Adeel
> (-1)
>         at
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1073)
>         at
> org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:691)
>         at
> org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
>         at com.att.hadoop.hivesort.HSMapper.map(HSMapper.java:39)
>         at com.att.hadoop.hivesort.HSMapper.map(HSMapper.java:1)
>         at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>         at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
>         at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:396)
>         at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1136)
>         at org.apache.hadoop.mapred.Child.main(Child.java:249)
>
>
> public int getPartition(Text key, HCatRecord record, int numParts) {
> //extract the group key from composite key
> String groupKey = key.toString().split("\\|")[0];
> return groupKey.hashCode() % numParts;
> }
>
>
> On Thu, Aug 29, 2013 at 8:31 PM, Shekhar Sharma <sh...@gmail.com>
> wrote:
>>
>> No...partitionr decides which keys should go to which reducer...and
>> number of reducers you need to decide...No of reducers depends on
>> factors like number of key value pair, use case etc
>> Regards,
>> Som Shekhar Sharma
>> +91-8197243810
>>
>>
>> On Fri, Aug 30, 2013 at 5:54 AM, Adeel Qureshi <ad...@gmail.com>
>> wrote:
>> > so it cant figure out an appropriate number of reducers as it does for
>> > mappers .. in my case hadoop is using 2100+ mappers and then only 1
>> > reducer
>> > .. since im overriding the partitioner class shouldnt that decide how
>> > manyredeucers there should be based on how many different partition
>> > values
>> > being returned by the custom partiotioner
>> >
>> >
>> > On Thu, Aug 29, 2013 at 7:38 PM, Ian Wrigley <ia...@cloudera.com> wrote:
>> >>
>> >> If you don't specify the number of Reducers, Hadoop will use the
>> >> default
>> >> -- which, unless you've changed it, is 1.
>> >>
>> >> Regards
>> >>
>> >> Ian.
>> >>
>> >> On Aug 29, 2013, at 4:23 PM, Adeel Qureshi <ad...@gmail.com>
>> >> wrote:
>> >>
>> >> I have implemented secondary sort in my MR job and for some reason if i
>> >> dont specify the number of reducers it uses 1 which doesnt seems right
>> >> because im working with 800M+ records and one reducer slows things down
>> >> significantly. Is this some kind of limitation with the secondary sort
>> >> that
>> >> it has to use a single reducer .. that kind of would defeat the purpose
>> >> of
>> >> having a scalable solution such as secondary sort. I would appreciate
>> >> any
>> >> help.
>> >>
>> >> Thanks
>> >> Adeel
>> >>
>> >>
>> >>
>> >> ---
>> >> Ian Wrigley
>> >> Sr. Curriculum Manager
>> >> Cloudera, Inc
>> >> Cell: (323) 819 4075
>> >>
>> >
>
>

Re: secondary sort - number of reducers

Posted by Shekhar Sharma <sh...@gmail.com>.

Is the hash code of that key  is negative.?
Do something like this

return groupKey.hashCode() & Integer.MAX_VALUE % numParts;

Regards,
Som Shekhar Sharma
+91-8197243810


On Fri, Aug 30, 2013 at 6:25 AM, Adeel Qureshi <ad...@gmail.com> wrote:
> okay so when i specify the number of reducers e.g. in my example i m using 4
> (for a much smaller data set) it works if I use a single column in my
> composite key .. but if I add multiple columns in the composite key
> separated by a delimi .. it then throws the illegal partition error (keys
> before the pipe are group keys and after the pipe are the sort keys and my
> partioner only uses the group keys
>
> java.io.IOException: Illegal partition for Atlanta:GA|Atlanta:GA:1:Adeel
> (-1)
>         at
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1073)
>         at
> org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:691)
>         at
> org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
>         at com.att.hadoop.hivesort.HSMapper.map(HSMapper.java:39)
>         at com.att.hadoop.hivesort.HSMapper.map(HSMapper.java:1)
>         at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>         at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
>         at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:396)
>         at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1136)
>         at org.apache.hadoop.mapred.Child.main(Child.java:249)
>
>
> public int getPartition(Text key, HCatRecord record, int numParts) {
> //extract the group key from composite key
> String groupKey = key.toString().split("\\|")[0];
> return groupKey.hashCode() % numParts;
> }
>
>
> On Thu, Aug 29, 2013 at 8:31 PM, Shekhar Sharma <sh...@gmail.com>
> wrote:
>>
>> No...partitionr decides which keys should go to which reducer...and
>> number of reducers you need to decide...No of reducers depends on
>> factors like number of key value pair, use case etc
>> Regards,
>> Som Shekhar Sharma
>> +91-8197243810
>>
>>
>> On Fri, Aug 30, 2013 at 5:54 AM, Adeel Qureshi <ad...@gmail.com>
>> wrote:
>> > so it cant figure out an appropriate number of reducers as it does for
>> > mappers .. in my case hadoop is using 2100+ mappers and then only 1
>> > reducer
>> > .. since im overriding the partitioner class shouldnt that decide how
>> > manyredeucers there should be based on how many different partition
>> > values
>> > being returned by the custom partiotioner
>> >
>> >
>> > On Thu, Aug 29, 2013 at 7:38 PM, Ian Wrigley <ia...@cloudera.com> wrote:
>> >>
>> >> If you don't specify the number of Reducers, Hadoop will use the
>> >> default
>> >> -- which, unless you've changed it, is 1.
>> >>
>> >> Regards
>> >>
>> >> Ian.
>> >>
>> >> On Aug 29, 2013, at 4:23 PM, Adeel Qureshi <ad...@gmail.com>
>> >> wrote:
>> >>
>> >> I have implemented secondary sort in my MR job and for some reason if i
>> >> dont specify the number of reducers it uses 1 which doesnt seems right
>> >> because im working with 800M+ records and one reducer slows things down
>> >> significantly. Is this some kind of limitation with the secondary sort
>> >> that
>> >> it has to use a single reducer .. that kind of would defeat the purpose
>> >> of
>> >> having a scalable solution such as secondary sort. I would appreciate
>> >> any
>> >> help.
>> >>
>> >> Thanks
>> >> Adeel
>> >>
>> >>
>> >>
>> >> ---
>> >> Ian Wrigley
>> >> Sr. Curriculum Manager
>> >> Cloudera, Inc
>> >> Cell: (323) 819 4075
>> >>
>> >
>
>

Re: secondary sort - number of reducers

Posted by Shekhar Sharma <sh...@gmail.com>.

Is the hash code of that key  is negative.?
Do something like this

return groupKey.hashCode() & Integer.MAX_VALUE % numParts;

Regards,
Som Shekhar Sharma
+91-8197243810


On Fri, Aug 30, 2013 at 6:25 AM, Adeel Qureshi <ad...@gmail.com> wrote:
> okay so when i specify the number of reducers e.g. in my example i m using 4
> (for a much smaller data set) it works if I use a single column in my
> composite key .. but if I add multiple columns in the composite key
> separated by a delimi .. it then throws the illegal partition error (keys
> before the pipe are group keys and after the pipe are the sort keys and my
> partioner only uses the group keys
>
> java.io.IOException: Illegal partition for Atlanta:GA|Atlanta:GA:1:Adeel
> (-1)
>         at
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1073)
>         at
> org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:691)
>         at
> org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
>         at com.att.hadoop.hivesort.HSMapper.map(HSMapper.java:39)
>         at com.att.hadoop.hivesort.HSMapper.map(HSMapper.java:1)
>         at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>         at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
>         at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:396)
>         at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1136)
>         at org.apache.hadoop.mapred.Child.main(Child.java:249)
>
>
> public int getPartition(Text key, HCatRecord record, int numParts) {
> //extract the group key from composite key
> String groupKey = key.toString().split("\\|")[0];
> return groupKey.hashCode() % numParts;
> }
>
>
> On Thu, Aug 29, 2013 at 8:31 PM, Shekhar Sharma <sh...@gmail.com>
> wrote:
>>
>> No...partitionr decides which keys should go to which reducer...and
>> number of reducers you need to decide...No of reducers depends on
>> factors like number of key value pair, use case etc
>> Regards,
>> Som Shekhar Sharma
>> +91-8197243810
>>
>>
>> On Fri, Aug 30, 2013 at 5:54 AM, Adeel Qureshi <ad...@gmail.com>
>> wrote:
>> > so it cant figure out an appropriate number of reducers as it does for
>> > mappers .. in my case hadoop is using 2100+ mappers and then only 1
>> > reducer
>> > .. since im overriding the partitioner class shouldnt that decide how
>> > manyredeucers there should be based on how many different partition
>> > values
>> > being returned by the custom partiotioner
>> >
>> >
>> > On Thu, Aug 29, 2013 at 7:38 PM, Ian Wrigley <ia...@cloudera.com> wrote:
>> >>
>> >> If you don't specify the number of Reducers, Hadoop will use the
>> >> default
>> >> -- which, unless you've changed it, is 1.
>> >>
>> >> Regards
>> >>
>> >> Ian.
>> >>
>> >> On Aug 29, 2013, at 4:23 PM, Adeel Qureshi <ad...@gmail.com>
>> >> wrote:
>> >>
>> >> I have implemented secondary sort in my MR job and for some reason if i
>> >> dont specify the number of reducers it uses 1 which doesnt seems right
>> >> because im working with 800M+ records and one reducer slows things down
>> >> significantly. Is this some kind of limitation with the secondary sort
>> >> that
>> >> it has to use a single reducer .. that kind of would defeat the purpose
>> >> of
>> >> having a scalable solution such as secondary sort. I would appreciate
>> >> any
>> >> help.
>> >>
>> >> Thanks
>> >> Adeel
>> >>
>> >>
>> >>
>> >> ---
>> >> Ian Wrigley
>> >> Sr. Curriculum Manager
>> >> Cloudera, Inc
>> >> Cell: (323) 819 4075
>> >>
>> >
>
>

RE: secondary sort - number of reducers

Posted by java8964 java8964 <ja...@hotmail.com>.

The method getPartition() needs to return a positive number. Simply use hashCode() method is not enough.
See the Hadoop HashPartitioner implementation:








return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
When I first read this code, I always wonder why not use Math.abs? Is ( & Integer.MAX_VALUE) faster?
Yong
Date: Thu, 29 Aug 2013 20:55:46 -0400
Subject: Re: secondary sort - number of reducers
From: adeelmahmood@gmail.com
To: user@hadoop.apache.org

okay so when i specify the number of reducers e.g. in my example i m using 4 (for a much smaller data set) it works if I use a single column in my composite key .. but if I add multiple columns in the composite key separated by a delimi .. it then throws the illegal partition error (keys before the pipe are group keys and after the pipe are the sort keys and my partioner only uses the group keys

java.io.IOException: Illegal partition for Atlanta:GA|Atlanta:GA:1:Adeel (-1)        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1073)        at org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:691)
        at org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)        at com.att.hadoop.hivesort.HSMapper.map(HSMapper.java:39)        at com.att.hadoop.hivesort.HSMapper.map(HSMapper.java:1)
        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
        at org.apache.hadoop.mapred.Child$4.run(Child.java:255)        at java.security.AccessController.doPrivileged(Native Method)        at javax.security.auth.Subject.doAs(Subject.java:396)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1136)        at org.apache.hadoop.mapred.Child.main(Child.java:249)


	public int getPartition(Text key, HCatRecord record, int numParts) {		//extract the group key from composite key
		String groupKey = key.toString().split("\\|")[0];				return groupKey.hashCode() % numParts;

	}

On Thu, Aug 29, 2013 at 8:31 PM, Shekhar Sharma <sh...@gmail.com> wrote:

No...partitionr decides which keys should go to which reducer...and


number of reducers you need to decide...No of reducers depends on

factors like number of key value pair, use case etc

Regards,

Som Shekhar Sharma

+91-8197243810





On Fri, Aug 30, 2013 at 5:54 AM, Adeel Qureshi <ad...@gmail.com> wrote:

> so it cant figure out an appropriate number of reducers as it does for

> mappers .. in my case hadoop is using 2100+ mappers and then only 1 reducer

> .. since im overriding the partitioner class shouldnt that decide how

> manyredeucers there should be based on how many different partition values

> being returned by the custom partiotioner

>

>

> On Thu, Aug 29, 2013 at 7:38 PM, Ian Wrigley <ia...@cloudera.com> wrote:

>>

>> If you don't specify the number of Reducers, Hadoop will use the default

>> -- which, unless you've changed it, is 1.

>>

>> Regards

>>

>> Ian.

>>

>> On Aug 29, 2013, at 4:23 PM, Adeel Qureshi <ad...@gmail.com> wrote:

>>

>> I have implemented secondary sort in my MR job and for some reason if i

>> dont specify the number of reducers it uses 1 which doesnt seems right

>> because im working with 800M+ records and one reducer slows things down

>> significantly. Is this some kind of limitation with the secondary sort that

>> it has to use a single reducer .. that kind of would defeat the purpose of

>> having a scalable solution such as secondary sort. I would appreciate any

>> help.

>>

>> Thanks

>> Adeel

>>

>>

>>

>> ---

>> Ian Wrigley

>> Sr. Curriculum Manager

>> Cloudera, Inc

>> Cell: (323) 819 4075

>>

>

RE: secondary sort - number of reducers

Posted by java8964 java8964 <ja...@hotmail.com>.

The method getPartition() needs to return a positive number. Simply use hashCode() method is not enough.
See the Hadoop HashPartitioner implementation:








return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
When I first read this code, I always wonder why not use Math.abs? Is ( & Integer.MAX_VALUE) faster?
Yong
Date: Thu, 29 Aug 2013 20:55:46 -0400
Subject: Re: secondary sort - number of reducers
From: adeelmahmood@gmail.com
To: user@hadoop.apache.org

okay so when i specify the number of reducers e.g. in my example i m using 4 (for a much smaller data set) it works if I use a single column in my composite key .. but if I add multiple columns in the composite key separated by a delimi .. it then throws the illegal partition error (keys before the pipe are group keys and after the pipe are the sort keys and my partioner only uses the group keys

java.io.IOException: Illegal partition for Atlanta:GA|Atlanta:GA:1:Adeel (-1)        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1073)        at org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:691)
        at org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)        at com.att.hadoop.hivesort.HSMapper.map(HSMapper.java:39)        at com.att.hadoop.hivesort.HSMapper.map(HSMapper.java:1)
        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
        at org.apache.hadoop.mapred.Child$4.run(Child.java:255)        at java.security.AccessController.doPrivileged(Native Method)        at javax.security.auth.Subject.doAs(Subject.java:396)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1136)        at org.apache.hadoop.mapred.Child.main(Child.java:249)


	public int getPartition(Text key, HCatRecord record, int numParts) {		//extract the group key from composite key
		String groupKey = key.toString().split("\\|")[0];				return groupKey.hashCode() % numParts;

	}

On Thu, Aug 29, 2013 at 8:31 PM, Shekhar Sharma <sh...@gmail.com> wrote:

No...partitionr decides which keys should go to which reducer...and


number of reducers you need to decide...No of reducers depends on

factors like number of key value pair, use case etc

Regards,

Som Shekhar Sharma

+91-8197243810





On Fri, Aug 30, 2013 at 5:54 AM, Adeel Qureshi <ad...@gmail.com> wrote:

> so it cant figure out an appropriate number of reducers as it does for

> mappers .. in my case hadoop is using 2100+ mappers and then only 1 reducer

> .. since im overriding the partitioner class shouldnt that decide how

> manyredeucers there should be based on how many different partition values

> being returned by the custom partiotioner

>

>

> On Thu, Aug 29, 2013 at 7:38 PM, Ian Wrigley <ia...@cloudera.com> wrote:

>>

>> If you don't specify the number of Reducers, Hadoop will use the default

>> -- which, unless you've changed it, is 1.

>>

>> Regards

>>

>> Ian.

>>

>> On Aug 29, 2013, at 4:23 PM, Adeel Qureshi <ad...@gmail.com> wrote:

>>

>> I have implemented secondary sort in my MR job and for some reason if i

>> dont specify the number of reducers it uses 1 which doesnt seems right

>> because im working with 800M+ records and one reducer slows things down

>> significantly. Is this some kind of limitation with the secondary sort that

>> it has to use a single reducer .. that kind of would defeat the purpose of

>> having a scalable solution such as secondary sort. I would appreciate any

>> help.

>>

>> Thanks

>> Adeel

>>

>>

>>

>> ---

>> Ian Wrigley

>> Sr. Curriculum Manager

>> Cloudera, Inc

>> Cell: (323) 819 4075

>>

>

Re: secondary sort - number of reducers

Posted by Adeel Qureshi <ad...@gmail.com>.

okay so when i specify the number of reducers e.g. in my example i m using
4 (for a much smaller data set) it works if I use a single column in my
composite key .. but if I add multiple columns in the composite key
separated by a delimi .. it then throws the illegal partition error (keys
before the pipe are group keys and after the pipe are the sort keys and my
partioner only uses the group keys

java.io.IOException: Illegal partition for *Atlanta:GA|Atlanta:GA:1:Adeel*(-1)
        at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1073)
        at
org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:691)
        at
org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
        at com.att.hadoop.hivesort.HSMapper.map(HSMapper.java:39)
        at com.att.hadoop.hivesort.HSMapper.map(HSMapper.java:1)
        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
        at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)
        at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1136)
        at org.apache.hadoop.mapred.Child.main(Child.java:249)


public int getPartition(Text key, HCatRecord record, int numParts) {
//extract the group key from composite key
String groupKey = key.toString().split("\\|")[0];
return groupKey.hashCode() % numParts;
}


On Thu, Aug 29, 2013 at 8:31 PM, Shekhar Sharma <sh...@gmail.com>wrote:

> No...partitionr decides which keys should go to which reducer...and
> number of reducers you need to decide...No of reducers depends on
> factors like number of key value pair, use case etc
> Regards,
> Som Shekhar Sharma
> +91-8197243810
>
>
> On Fri, Aug 30, 2013 at 5:54 AM, Adeel Qureshi <ad...@gmail.com>
> wrote:
> > so it cant figure out an appropriate number of reducers as it does for
> > mappers .. in my case hadoop is using 2100+ mappers and then only 1
> reducer
> > .. since im overriding the partitioner class shouldnt that decide how
> > manyredeucers there should be based on how many different partition
> values
> > being returned by the custom partiotioner
> >
> >
> > On Thu, Aug 29, 2013 at 7:38 PM, Ian Wrigley <ia...@cloudera.com> wrote:
> >>
> >> If you don't specify the number of Reducers, Hadoop will use the default
> >> -- which, unless you've changed it, is 1.
> >>
> >> Regards
> >>
> >> Ian.
> >>
> >> On Aug 29, 2013, at 4:23 PM, Adeel Qureshi <ad...@gmail.com>
> wrote:
> >>
> >> I have implemented secondary sort in my MR job and for some reason if i
> >> dont specify the number of reducers it uses 1 which doesnt seems right
> >> because im working with 800M+ records and one reducer slows things down
> >> significantly. Is this some kind of limitation with the secondary sort
> that
> >> it has to use a single reducer .. that kind of would defeat the purpose
> of
> >> having a scalable solution such as secondary sort. I would appreciate
> any
> >> help.
> >>
> >> Thanks
> >> Adeel
> >>
> >>
> >>
> >> ---
> >> Ian Wrigley
> >> Sr. Curriculum Manager
> >> Cloudera, Inc
> >> Cell: (323) 819 4075
> >>
> >
>

Re: secondary sort - number of reducers

Posted by Adeel Qureshi <ad...@gmail.com>.

okay so when i specify the number of reducers e.g. in my example i m using
4 (for a much smaller data set) it works if I use a single column in my
composite key .. but if I add multiple columns in the composite key
separated by a delimi .. it then throws the illegal partition error (keys
before the pipe are group keys and after the pipe are the sort keys and my
partioner only uses the group keys

java.io.IOException: Illegal partition for *Atlanta:GA|Atlanta:GA:1:Adeel*(-1)
        at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1073)
        at
org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:691)
        at
org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
        at com.att.hadoop.hivesort.HSMapper.map(HSMapper.java:39)
        at com.att.hadoop.hivesort.HSMapper.map(HSMapper.java:1)
        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
        at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)
        at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1136)
        at org.apache.hadoop.mapred.Child.main(Child.java:249)


public int getPartition(Text key, HCatRecord record, int numParts) {
//extract the group key from composite key
String groupKey = key.toString().split("\\|")[0];
return groupKey.hashCode() % numParts;
}


On Thu, Aug 29, 2013 at 8:31 PM, Shekhar Sharma <sh...@gmail.com>wrote:

> No...partitionr decides which keys should go to which reducer...and
> number of reducers you need to decide...No of reducers depends on
> factors like number of key value pair, use case etc
> Regards,
> Som Shekhar Sharma
> +91-8197243810
>
>
> On Fri, Aug 30, 2013 at 5:54 AM, Adeel Qureshi <ad...@gmail.com>
> wrote:
> > so it cant figure out an appropriate number of reducers as it does for
> > mappers .. in my case hadoop is using 2100+ mappers and then only 1
> reducer
> > .. since im overriding the partitioner class shouldnt that decide how
> > manyredeucers there should be based on how many different partition
> values
> > being returned by the custom partiotioner
> >
> >
> > On Thu, Aug 29, 2013 at 7:38 PM, Ian Wrigley <ia...@cloudera.com> wrote:
> >>
> >> If you don't specify the number of Reducers, Hadoop will use the default
> >> -- which, unless you've changed it, is 1.
> >>
> >> Regards
> >>
> >> Ian.
> >>
> >> On Aug 29, 2013, at 4:23 PM, Adeel Qureshi <ad...@gmail.com>
> wrote:
> >>
> >> I have implemented secondary sort in my MR job and for some reason if i
> >> dont specify the number of reducers it uses 1 which doesnt seems right
> >> because im working with 800M+ records and one reducer slows things down
> >> significantly. Is this some kind of limitation with the secondary sort
> that
> >> it has to use a single reducer .. that kind of would defeat the purpose
> of
> >> having a scalable solution such as secondary sort. I would appreciate
> any
> >> help.
> >>
> >> Thanks
> >> Adeel
> >>
> >>
> >>
> >> ---
> >> Ian Wrigley
> >> Sr. Curriculum Manager
> >> Cloudera, Inc
> >> Cell: (323) 819 4075
> >>
> >
>

Re: secondary sort - number of reducers

Posted by Adeel Qureshi <ad...@gmail.com>.

okay so when i specify the number of reducers e.g. in my example i m using
4 (for a much smaller data set) it works if I use a single column in my
composite key .. but if I add multiple columns in the composite key
separated by a delimi .. it then throws the illegal partition error (keys
before the pipe are group keys and after the pipe are the sort keys and my
partioner only uses the group keys

java.io.IOException: Illegal partition for *Atlanta:GA|Atlanta:GA:1:Adeel*(-1)
        at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1073)
        at
org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:691)
        at
org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
        at com.att.hadoop.hivesort.HSMapper.map(HSMapper.java:39)
        at com.att.hadoop.hivesort.HSMapper.map(HSMapper.java:1)
        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
        at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)
        at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1136)
        at org.apache.hadoop.mapred.Child.main(Child.java:249)


public int getPartition(Text key, HCatRecord record, int numParts) {
//extract the group key from composite key
String groupKey = key.toString().split("\\|")[0];
return groupKey.hashCode() % numParts;
}


On Thu, Aug 29, 2013 at 8:31 PM, Shekhar Sharma <sh...@gmail.com>wrote:

> No...partitionr decides which keys should go to which reducer...and
> number of reducers you need to decide...No of reducers depends on
> factors like number of key value pair, use case etc
> Regards,
> Som Shekhar Sharma
> +91-8197243810
>
>
> On Fri, Aug 30, 2013 at 5:54 AM, Adeel Qureshi <ad...@gmail.com>
> wrote:
> > so it cant figure out an appropriate number of reducers as it does for
> > mappers .. in my case hadoop is using 2100+ mappers and then only 1
> reducer
> > .. since im overriding the partitioner class shouldnt that decide how
> > manyredeucers there should be based on how many different partition
> values
> > being returned by the custom partiotioner
> >
> >
> > On Thu, Aug 29, 2013 at 7:38 PM, Ian Wrigley <ia...@cloudera.com> wrote:
> >>
> >> If you don't specify the number of Reducers, Hadoop will use the default
> >> -- which, unless you've changed it, is 1.
> >>
> >> Regards
> >>
> >> Ian.
> >>
> >> On Aug 29, 2013, at 4:23 PM, Adeel Qureshi <ad...@gmail.com>
> wrote:
> >>
> >> I have implemented secondary sort in my MR job and for some reason if i
> >> dont specify the number of reducers it uses 1 which doesnt seems right
> >> because im working with 800M+ records and one reducer slows things down
> >> significantly. Is this some kind of limitation with the secondary sort
> that
> >> it has to use a single reducer .. that kind of would defeat the purpose
> of
> >> having a scalable solution such as secondary sort. I would appreciate
> any
> >> help.
> >>
> >> Thanks
> >> Adeel
> >>
> >>
> >>
> >> ---
> >> Ian Wrigley
> >> Sr. Curriculum Manager
> >> Cloudera, Inc
> >> Cell: (323) 819 4075
> >>
> >
>

Re: secondary sort - number of reducers

Posted by Adeel Qureshi <ad...@gmail.com>.

okay so when i specify the number of reducers e.g. in my example i m using
4 (for a much smaller data set) it works if I use a single column in my
composite key .. but if I add multiple columns in the composite key
separated by a delimi .. it then throws the illegal partition error (keys
before the pipe are group keys and after the pipe are the sort keys and my
partioner only uses the group keys

java.io.IOException: Illegal partition for *Atlanta:GA|Atlanta:GA:1:Adeel*(-1)
        at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1073)
        at
org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:691)
        at
org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
        at com.att.hadoop.hivesort.HSMapper.map(HSMapper.java:39)
        at com.att.hadoop.hivesort.HSMapper.map(HSMapper.java:1)
        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
        at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)
        at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1136)
        at org.apache.hadoop.mapred.Child.main(Child.java:249)


public int getPartition(Text key, HCatRecord record, int numParts) {
//extract the group key from composite key
String groupKey = key.toString().split("\\|")[0];
return groupKey.hashCode() % numParts;
}


On Thu, Aug 29, 2013 at 8:31 PM, Shekhar Sharma <sh...@gmail.com>wrote:

> No...partitionr decides which keys should go to which reducer...and
> number of reducers you need to decide...No of reducers depends on
> factors like number of key value pair, use case etc
> Regards,
> Som Shekhar Sharma
> +91-8197243810
>
>
> On Fri, Aug 30, 2013 at 5:54 AM, Adeel Qureshi <ad...@gmail.com>
> wrote:
> > so it cant figure out an appropriate number of reducers as it does for
> > mappers .. in my case hadoop is using 2100+ mappers and then only 1
> reducer
> > .. since im overriding the partitioner class shouldnt that decide how
> > manyredeucers there should be based on how many different partition
> values
> > being returned by the custom partiotioner
> >
> >
> > On Thu, Aug 29, 2013 at 7:38 PM, Ian Wrigley <ia...@cloudera.com> wrote:
> >>
> >> If you don't specify the number of Reducers, Hadoop will use the default
> >> -- which, unless you've changed it, is 1.
> >>
> >> Regards
> >>
> >> Ian.
> >>
> >> On Aug 29, 2013, at 4:23 PM, Adeel Qureshi <ad...@gmail.com>
> wrote:
> >>
> >> I have implemented secondary sort in my MR job and for some reason if i
> >> dont specify the number of reducers it uses 1 which doesnt seems right
> >> because im working with 800M+ records and one reducer slows things down
> >> significantly. Is this some kind of limitation with the secondary sort
> that
> >> it has to use a single reducer .. that kind of would defeat the purpose
> of
> >> having a scalable solution such as secondary sort. I would appreciate
> any
> >> help.
> >>
> >> Thanks
> >> Adeel
> >>
> >>
> >>
> >> ---
> >> Ian Wrigley
> >> Sr. Curriculum Manager
> >> Cloudera, Inc
> >> Cell: (323) 819 4075
> >>
> >
>

Re: secondary sort - number of reducers

Posted by Shekhar Sharma <sh...@gmail.com>.

No...partitionr decides which keys should go to which reducer...and
number of reducers you need to decide...No of reducers depends on
factors like number of key value pair, use case etc
Regards,
Som Shekhar Sharma
+91-8197243810


On Fri, Aug 30, 2013 at 5:54 AM, Adeel Qureshi <ad...@gmail.com> wrote:
> so it cant figure out an appropriate number of reducers as it does for
> mappers .. in my case hadoop is using 2100+ mappers and then only 1 reducer
> .. since im overriding the partitioner class shouldnt that decide how
> manyredeucers there should be based on how many different partition values
> being returned by the custom partiotioner
>
>
> On Thu, Aug 29, 2013 at 7:38 PM, Ian Wrigley <ia...@cloudera.com> wrote:
>>
>> If you don't specify the number of Reducers, Hadoop will use the default
>> -- which, unless you've changed it, is 1.
>>
>> Regards
>>
>> Ian.
>>
>> On Aug 29, 2013, at 4:23 PM, Adeel Qureshi <ad...@gmail.com> wrote:
>>
>> I have implemented secondary sort in my MR job and for some reason if i
>> dont specify the number of reducers it uses 1 which doesnt seems right
>> because im working with 800M+ records and one reducer slows things down
>> significantly. Is this some kind of limitation with the secondary sort that
>> it has to use a single reducer .. that kind of would defeat the purpose of
>> having a scalable solution such as secondary sort. I would appreciate any
>> help.
>>
>> Thanks
>> Adeel
>>
>>
>>
>> ---
>> Ian Wrigley
>> Sr. Curriculum Manager
>> Cloudera, Inc
>> Cell: (323) 819 4075
>>
>

Re: secondary sort - number of reducers

Posted by Shekhar Sharma <sh...@gmail.com>.

No...partitionr decides which keys should go to which reducer...and
number of reducers you need to decide...No of reducers depends on
factors like number of key value pair, use case etc
Regards,
Som Shekhar Sharma
+91-8197243810


On Fri, Aug 30, 2013 at 5:54 AM, Adeel Qureshi <ad...@gmail.com> wrote:
> so it cant figure out an appropriate number of reducers as it does for
> mappers .. in my case hadoop is using 2100+ mappers and then only 1 reducer
> .. since im overriding the partitioner class shouldnt that decide how
> manyredeucers there should be based on how many different partition values
> being returned by the custom partiotioner
>
>
> On Thu, Aug 29, 2013 at 7:38 PM, Ian Wrigley <ia...@cloudera.com> wrote:
>>
>> If you don't specify the number of Reducers, Hadoop will use the default
>> -- which, unless you've changed it, is 1.
>>
>> Regards
>>
>> Ian.
>>
>> On Aug 29, 2013, at 4:23 PM, Adeel Qureshi <ad...@gmail.com> wrote:
>>
>> I have implemented secondary sort in my MR job and for some reason if i
>> dont specify the number of reducers it uses 1 which doesnt seems right
>> because im working with 800M+ records and one reducer slows things down
>> significantly. Is this some kind of limitation with the secondary sort that
>> it has to use a single reducer .. that kind of would defeat the purpose of
>> having a scalable solution such as secondary sort. I would appreciate any
>> help.
>>
>> Thanks
>> Adeel
>>
>>
>>
>> ---
>> Ian Wrigley
>> Sr. Curriculum Manager
>> Cloudera, Inc
>> Cell: (323) 819 4075
>>
>

Re: secondary sort - number of reducers

Posted by Shekhar Sharma <sh...@gmail.com>.

No...partitionr decides which keys should go to which reducer...and
number of reducers you need to decide...No of reducers depends on
factors like number of key value pair, use case etc
Regards,
Som Shekhar Sharma
+91-8197243810


On Fri, Aug 30, 2013 at 5:54 AM, Adeel Qureshi <ad...@gmail.com> wrote:
> so it cant figure out an appropriate number of reducers as it does for
> mappers .. in my case hadoop is using 2100+ mappers and then only 1 reducer
> .. since im overriding the partitioner class shouldnt that decide how
> manyredeucers there should be based on how many different partition values
> being returned by the custom partiotioner
>
>
> On Thu, Aug 29, 2013 at 7:38 PM, Ian Wrigley <ia...@cloudera.com> wrote:
>>
>> If you don't specify the number of Reducers, Hadoop will use the default
>> -- which, unless you've changed it, is 1.
>>
>> Regards
>>
>> Ian.
>>
>> On Aug 29, 2013, at 4:23 PM, Adeel Qureshi <ad...@gmail.com> wrote:
>>
>> I have implemented secondary sort in my MR job and for some reason if i
>> dont specify the number of reducers it uses 1 which doesnt seems right
>> because im working with 800M+ records and one reducer slows things down
>> significantly. Is this some kind of limitation with the secondary sort that
>> it has to use a single reducer .. that kind of would defeat the purpose of
>> having a scalable solution such as secondary sort. I would appreciate any
>> help.
>>
>> Thanks
>> Adeel
>>
>>
>>
>> ---
>> Ian Wrigley
>> Sr. Curriculum Manager
>> Cloudera, Inc
>> Cell: (323) 819 4075
>>
>

Re: secondary sort - number of reducers

Posted by Shekhar Sharma <sh...@gmail.com>.

No...partitionr decides which keys should go to which reducer...and
number of reducers you need to decide...No of reducers depends on
factors like number of key value pair, use case etc
Regards,
Som Shekhar Sharma
+91-8197243810


On Fri, Aug 30, 2013 at 5:54 AM, Adeel Qureshi <ad...@gmail.com> wrote:
> so it cant figure out an appropriate number of reducers as it does for
> mappers .. in my case hadoop is using 2100+ mappers and then only 1 reducer
> .. since im overriding the partitioner class shouldnt that decide how
> manyredeucers there should be based on how many different partition values
> being returned by the custom partiotioner
>
>
> On Thu, Aug 29, 2013 at 7:38 PM, Ian Wrigley <ia...@cloudera.com> wrote:
>>
>> If you don't specify the number of Reducers, Hadoop will use the default
>> -- which, unless you've changed it, is 1.
>>
>> Regards
>>
>> Ian.
>>
>> On Aug 29, 2013, at 4:23 PM, Adeel Qureshi <ad...@gmail.com> wrote:
>>
>> I have implemented secondary sort in my MR job and for some reason if i
>> dont specify the number of reducers it uses 1 which doesnt seems right
>> because im working with 800M+ records and one reducer slows things down
>> significantly. Is this some kind of limitation with the secondary sort that
>> it has to use a single reducer .. that kind of would defeat the purpose of
>> having a scalable solution such as secondary sort. I would appreciate any
>> help.
>>
>> Thanks
>> Adeel
>>
>>
>>
>> ---
>> Ian Wrigley
>> Sr. Curriculum Manager
>> Cloudera, Inc
>> Cell: (323) 819 4075
>>
>

Re: secondary sort - number of reducers

Posted by Adeel Qureshi <ad...@gmail.com>.

so it cant figure out an appropriate number of reducers as it does for
mappers .. in my case hadoop is using 2100+ mappers and then only 1 reducer
.. since im overriding the partitioner class shouldnt that decide how
manyredeucers there should be based on how many different partition values
being returned by the custom partiotioner

On Thu, Aug 29, 2013 at 7:38 PM, Ian Wrigley <ia...@cloudera.com> wrote:

> If you don't specify the number of Reducers, Hadoop will use the default
> -- which, unless you've changed it, is 1.
>
> Regards
>
> Ian.
>
> On Aug 29, 2013, at 4:23 PM, Adeel Qureshi <ad...@gmail.com> wrote:
>
> I have implemented secondary sort in my MR job and for some reason if i
> dont specify the number of reducers it uses 1 which doesnt seems right
> because im working with 800M+ records and one reducer slows things down
> significantly. Is this some kind of limitation with the secondary sort that
> it has to use a single reducer .. that kind of would defeat the purpose of
> having a scalable solution such as secondary sort. I would appreciate any
> help.
>
> Thanks
> Adeel
>
>
>
> ---
> Ian Wrigley
> Sr. Curriculum Manager
> Cloudera, Inc
> Cell: (323) 819 4075
>
>

Re: secondary sort - number of reducers

Posted by Adeel Qureshi <ad...@gmail.com>.

so it cant figure out an appropriate number of reducers as it does for
mappers .. in my case hadoop is using 2100+ mappers and then only 1 reducer
.. since im overriding the partitioner class shouldnt that decide how
manyredeucers there should be based on how many different partition values
being returned by the custom partiotioner

On Thu, Aug 29, 2013 at 7:38 PM, Ian Wrigley <ia...@cloudera.com> wrote:

> If you don't specify the number of Reducers, Hadoop will use the default
> -- which, unless you've changed it, is 1.
>
> Regards
>
> Ian.
>
> On Aug 29, 2013, at 4:23 PM, Adeel Qureshi <ad...@gmail.com> wrote:
>
> I have implemented secondary sort in my MR job and for some reason if i
> dont specify the number of reducers it uses 1 which doesnt seems right
> because im working with 800M+ records and one reducer slows things down
> significantly. Is this some kind of limitation with the secondary sort that
> it has to use a single reducer .. that kind of would defeat the purpose of
> having a scalable solution such as secondary sort. I would appreciate any
> help.
>
> Thanks
> Adeel
>
>
>
> ---
> Ian Wrigley
> Sr. Curriculum Manager
> Cloudera, Inc
> Cell: (323) 819 4075
>
>

Re: secondary sort - number of reducers

Posted by Adeel Qureshi <ad...@gmail.com>.

so it cant figure out an appropriate number of reducers as it does for
mappers .. in my case hadoop is using 2100+ mappers and then only 1 reducer
.. since im overriding the partitioner class shouldnt that decide how
manyredeucers there should be based on how many different partition values
being returned by the custom partiotioner

On Thu, Aug 29, 2013 at 7:38 PM, Ian Wrigley <ia...@cloudera.com> wrote:

> If you don't specify the number of Reducers, Hadoop will use the default
> -- which, unless you've changed it, is 1.
>
> Regards
>
> Ian.
>
> On Aug 29, 2013, at 4:23 PM, Adeel Qureshi <ad...@gmail.com> wrote:
>
> I have implemented secondary sort in my MR job and for some reason if i
> dont specify the number of reducers it uses 1 which doesnt seems right
> because im working with 800M+ records and one reducer slows things down
> significantly. Is this some kind of limitation with the secondary sort that
> it has to use a single reducer .. that kind of would defeat the purpose of
> having a scalable solution such as secondary sort. I would appreciate any
> help.
>
> Thanks
> Adeel
>
>
>
> ---
> Ian Wrigley
> Sr. Curriculum Manager
> Cloudera, Inc
> Cell: (323) 819 4075
>
>

Re: secondary sort - number of reducers

Posted by Adeel Qureshi <ad...@gmail.com>.

so it cant figure out an appropriate number of reducers as it does for
mappers .. in my case hadoop is using 2100+ mappers and then only 1 reducer
.. since im overriding the partitioner class shouldnt that decide how
manyredeucers there should be based on how many different partition values
being returned by the custom partiotioner

On Thu, Aug 29, 2013 at 7:38 PM, Ian Wrigley <ia...@cloudera.com> wrote:

> If you don't specify the number of Reducers, Hadoop will use the default
> -- which, unless you've changed it, is 1.
>
> Regards
>
> Ian.
>
> On Aug 29, 2013, at 4:23 PM, Adeel Qureshi <ad...@gmail.com> wrote:
>
> I have implemented secondary sort in my MR job and for some reason if i
> dont specify the number of reducers it uses 1 which doesnt seems right
> because im working with 800M+ records and one reducer slows things down
> significantly. Is this some kind of limitation with the secondary sort that
> it has to use a single reducer .. that kind of would defeat the purpose of
> having a scalable solution such as secondary sort. I would appreciate any
> help.
>
> Thanks
> Adeel
>
>
>
> ---
> Ian Wrigley
> Sr. Curriculum Manager
> Cloudera, Inc
> Cell: (323) 819 4075
>
>

Re: secondary sort - number of reducers

Posted by Ian Wrigley <ia...@cloudera.com>.

If you don't specify the number of Reducers, Hadoop will use the default -- which, unless you've changed it, is 1.

Regards

Ian.

On Aug 29, 2013, at 4:23 PM, Adeel Qureshi <ad...@gmail.com> wrote:

> I have implemented secondary sort in my MR job and for some reason if i dont specify the number of reducers it uses 1 which doesnt seems right because im working with 800M+ records and one reducer slows things down significantly. Is this some kind of limitation with the secondary sort that it has to use a single reducer .. that kind of would defeat the purpose of having a scalable solution such as secondary sort. I would appreciate any help.
> 
> Thanks
> Adeel


---
Ian Wrigley
Sr. Curriculum Manager
Cloudera, Inc
Cell: (323) 819 4075

Re: secondary sort - number of reducers

Posted by Ian Wrigley <ia...@cloudera.com>.

If you don't specify the number of Reducers, Hadoop will use the default -- which, unless you've changed it, is 1.

Regards

Ian.

On Aug 29, 2013, at 4:23 PM, Adeel Qureshi <ad...@gmail.com> wrote:

> I have implemented secondary sort in my MR job and for some reason if i dont specify the number of reducers it uses 1 which doesnt seems right because im working with 800M+ records and one reducer slows things down significantly. Is this some kind of limitation with the secondary sort that it has to use a single reducer .. that kind of would defeat the purpose of having a scalable solution such as secondary sort. I would appreciate any help.
> 
> Thanks
> Adeel


---
Ian Wrigley
Sr. Curriculum Manager
Cloudera, Inc
Cell: (323) 819 4075

Re: secondary sort - number of reducers

Posted by Ian Wrigley <ia...@cloudera.com>.

If you don't specify the number of Reducers, Hadoop will use the default -- which, unless you've changed it, is 1.

Regards

Ian.

On Aug 29, 2013, at 4:23 PM, Adeel Qureshi <ad...@gmail.com> wrote:

> I have implemented secondary sort in my MR job and for some reason if i dont specify the number of reducers it uses 1 which doesnt seems right because im working with 800M+ records and one reducer slows things down significantly. Is this some kind of limitation with the secondary sort that it has to use a single reducer .. that kind of would defeat the purpose of having a scalable solution such as secondary sort. I would appreciate any help.
> 
> Thanks
> Adeel


---
Ian Wrigley
Sr. Curriculum Manager
Cloudera, Inc
Cell: (323) 819 4075

Re: secondary sort - number of reducers

Posted by Ian Wrigley <ia...@cloudera.com>.

If you don't specify the number of Reducers, Hadoop will use the default -- which, unless you've changed it, is 1.

Regards

Ian.

On Aug 29, 2013, at 4:23 PM, Adeel Qureshi <ad...@gmail.com> wrote:

> I have implemented secondary sort in my MR job and for some reason if i dont specify the number of reducers it uses 1 which doesnt seems right because im working with 800M+ records and one reducer slows things down significantly. Is this some kind of limitation with the secondary sort that it has to use a single reducer .. that kind of would defeat the purpose of having a scalable solution such as secondary sort. I would appreciate any help.
> 
> Thanks
> Adeel


---
Ian Wrigley
Sr. Curriculum Manager
Cloudera, Inc
Cell: (323) 819 4075