You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by Ahmed Abdeen Hamed <ah...@gmail.com> on 2012/01/23 21:29:51 UTC

distributing a time consuming single reduce task

Hello friends,

I wrote a reduce() that receives a large dataset as a text values from the
map(). The purpose of the reduce() is to compute the distance between each
item in the values text. When I do, I run out of memory. I tried to
increase the heap size but that didn't scale either. I am wondering if
there is a way that I can distribute the reduce() to get it to scale. If
this is possible, can you kindly share your idea?
Please note, it is crucial for the values to be passed together in the
fashion that I am doing, so they can be clustered into groups.

Here is what the reduce() looks like:



public static class BrandClusteringReducer extends Reducer<Text, Text,
Text, Text> {
 Text key = new Text("1");

        Set<String> inputSet = new HashSet<String>();
        StringBuilder clusterBuilder = new StringBuilder();
        Set<Set<String>> clClustering = null;
        Text group = new Text();

        // Complete-Link Clusterer
        HierarchicalClusterer<String> clClusterer = new
CompleteLinkClusterer<String>(MAX_DISTANCE, EDIT_DISTANCE);
        String[] brandsList = null;
public void reduce(Text productID, Iterable<Text> brandNames, Context
context) throws IOException, InterruptedException {
 for(Text brand: brandNames){
inputSet.add(brand.toString());
}
        // perform clustering on the inputSet
        clClustering = clClusterer.cluster(inputSet);

        Iterator<Set<String>> itr = clClustering.iterator();
        while(itr.hasNext()){

         Set<String> brandsSet = itr.next();
         clusterBuilder.append("[");
         for(String aBrand: brandsSet){
         clusterBuilder.append(aBrand + ",");
         }
         clusterBuilder.append("]");
        }
        group.set(clusterBuilder.toString());
        clusterBuilder = new StringBuilder();
        context.write(key, group);

 }
}



Thanks,
-Ahmed

Re: distributing a time consuming single reduce task

Posted by Ahmed Abdeen Hamed <ah...@gmail.com>.
Thanks very much Steve!

The clustering part of the code is really a blackbox and there isn't much
to do as far as restructuring. I ended up breaking the big input file into
smaller ones and I am letting it running on the cluster. I will know in the
morning if it successfully or not. But, I will consider using Mahout for
clustering since it is built-in with the mapreduce. I will let you know how
that goes if you are interested.

Thanks very much once again for your kind responses!
-Ahmed


On Mon, Jan 23, 2012 at 9:09 PM, Steve Lewis <lo...@gmail.com> wrote:

>  It sounds like the  HierarchicalClusterer  whatever that is is doing what
> a collection of reducers should be doing - try to restructure the job so
> that the clustering is done more in the sort step allowing the reducer to
> simply collect clusters - the cluster method needs to be
> rearchitected to lean more heavily on map-reduce
>

Re: distributing a time consuming single reduce task

Posted by Steve Lewis <lo...@gmail.com>.
 It sounds like the  HierarchicalClusterer  whatever that is is doing what
a collection of reducers should be doing - try to restructure the job so
that the clustering is done more in the sort step allowing the reducer to
simply collect clusters - the cluster method needs to be
rearchitected to lean more heavily on map-reduce



On Mon, Jan 23, 2012 at 12:57 PM, Ahmed Abdeen Hamed <
ahmed.elmasri@gmail.com> wrote:

> Thanks very much for the valuable tips! I made the changes that you
> pointed. I am unclear on how to handle that many items all at once without
> putting them all in memory. I can split the file into a few files which
> could be helpful but I could also be splitting a group into two different
> files. To answer your question about how many elements I have in memory,
> there are 871671 items.
>
> Below is how the reduce () looks like after I followed your suggestions
> which still ran out of memory. I would kindly appreciate a few more tips
> before I can try splitting the files. It feels like it is against the
> spirit of Hadoop.
>
> public static class BrandClusteringReducer extends Reducer<Text, Text,
> Text, Text> {
>         // Complete-Link Clusterer
>         HierarchicalClusterer<String> clClusterer = new
> CompleteLinkClusterer<String>(MAX_DISTANCE, EDIT_DISTANCE);
> public void reduce(Text productID, Iterable<Text> brandNames, Context
> context) throws IOException, InterruptedException {
>  Text key = new Text("1");
>         Set<Set<String>> clClustering = null;
>         Text group = new Text();
>  Set<String> inputSet = new HashSet<String>();
>  StringBuilder clusterBuilder = new StringBuilder();
>  for(Text brand: brandNames){
> inputSet.add(brand.toString());
> }
>          // perform clustering on the inputSet
>         clClustering = clClusterer.cluster(inputSet);
>
>         Iterator<Set<String>> itr = clClustering.iterator();
>         while(itr.hasNext()){
>
>          Set<String> brandsSet = itr.next();
>          clusterBuilder.append("[");
>          for(String aBrand: brandsSet){
>          clusterBuilder.append(aBrand + ",");
>          }
>          clusterBuilder.append("]");
>         }
>         group.set(clusterBuilder.toString());
>         clusterBuilder = new StringBuilder();
>         context.write(key, group);
>         inputSet = null;
>         clusterBuilder = null;
> }
>  }
>
>
>
>
>
>
> On Mon, Jan 23, 2012 at 3:41 PM, Steve Lewis <lo...@gmail.com>wrote:
>
>> In general keeping the values you iterate through in memory in the
>> inputSet   is a bad idea -
>> How many itens do you have and how large is  inputSet     when you finish.
>> You should make inputSet a local variable in the reduce method since you
>> are not using
>> its contents later,
>>   ALkso with the publixhed code that set will expand forever since you do
>> not clear it after the reduce method and that will surely run you out of
>> memory
>>
>


-- 
Steven M. Lewis PhD
4221 105th Ave NE
Kirkland, WA 98033
206-384-1340 (cell)
Skype lordjoe_com

Re: distributing a time consuming single reduce task

Posted by Ahmed Abdeen Hamed <ah...@gmail.com>.
Thanks very much for the valuable tips! I made the changes that you
pointed. I am unclear on how to handle that many items all at once without
putting them all in memory. I can split the file into a few files which
could be helpful but I could also be splitting a group into two different
files. To answer your question about how many elements I have in memory,
there are 871671 items.

Below is how the reduce () looks like after I followed your suggestions
which still ran out of memory. I would kindly appreciate a few more tips
before I can try splitting the files. It feels like it is against the
spirit of Hadoop.

public static class BrandClusteringReducer extends Reducer<Text, Text,
Text, Text> {
        // Complete-Link Clusterer
        HierarchicalClusterer<String> clClusterer = new
CompleteLinkClusterer<String>(MAX_DISTANCE, EDIT_DISTANCE);
public void reduce(Text productID, Iterable<Text> brandNames, Context
context) throws IOException, InterruptedException {
 Text key = new Text("1");
        Set<Set<String>> clClustering = null;
        Text group = new Text();
Set<String> inputSet = new HashSet<String>();
 StringBuilder clusterBuilder = new StringBuilder();
for(Text brand: brandNames){
inputSet.add(brand.toString());
}
        // perform clustering on the inputSet
        clClustering = clClusterer.cluster(inputSet);

        Iterator<Set<String>> itr = clClustering.iterator();
        while(itr.hasNext()){

         Set<String> brandsSet = itr.next();
         clusterBuilder.append("[");
         for(String aBrand: brandsSet){
         clusterBuilder.append(aBrand + ",");
         }
         clusterBuilder.append("]");
        }
        group.set(clusterBuilder.toString());
        clusterBuilder = new StringBuilder();
        context.write(key, group);
        inputSet = null;
        clusterBuilder = null;
}
}






On Mon, Jan 23, 2012 at 3:41 PM, Steve Lewis <lo...@gmail.com> wrote:

> In general keeping the values you iterate through in memory in the
> inputSet   is a bad idea -
> How many itens do you have and how large is  inputSet     when you finish.
> You should make inputSet a local variable in the reduce method since you
> are not using
> its contents later,
>   ALkso with the publixhed code that set will expand forever since you do
> not clear it after the reduce method and that will surely run you out of
> memory
>

Re: distributing a time consuming single reduce task

Posted by Steve Lewis <lo...@gmail.com>.
In general keeping the values you iterate through in memory in the
inputSet   is a bad idea -
How many itens do you have and how large is  inputSet     when you finish.
You should make inputSet a local variable in the reduce method since you
are not using
its contents later,
  ALkso with the publixhed code that set will expand forever since you do
not clear it after the reduce method and that will surely run you out of
memory

On Mon, Jan 23, 2012 at 12:29 PM, Ahmed Abdeen Hamed <
ahmed.elmasri@gmail.com> wrote:

> Hello friends,
>
> I wrote a reduce() that receives a large dataset as a text values from the
> map(). The purpose of the reduce() is to compute the distance between each
> item in the values text. When I do, I run out of memory. I tried to
> increase the heap size but that didn't scale either. I am wondering if
> there is a way that I can distribute the reduce() to get it to scale. If
> this is possible, can you kindly share your idea?
> Please note, it is crucial for the values to be passed together in the
> fashion that I am doing, so they can be clustered into groups.
>
> Here is what the reduce() looks like:
>
>
>
> public static class BrandClusteringReducer extends Reducer<Text, Text,
> Text, Text> {
>  Text key = new Text("1");
>
>         Set<String> inputSet = new HashSet<String>();
>         StringBuilder clusterBuilder = new StringBuilder();
>         Set<Set<String>> clClustering = null;
>         Text group = new Text();
>
>         // Complete-Link Clusterer
>         HierarchicalClusterer<String> clClusterer = new
> CompleteLinkClusterer<String>(MAX_DISTANCE, EDIT_DISTANCE);
>         String[] brandsList = null;
>  public void reduce(Text productID, Iterable<Text> brandNames, Context
> context) throws IOException, InterruptedException {
>  for(Text brand: brandNames){
> inputSet.add(brand.toString());
>  }
>         // perform clustering on the inputSet
>         clClustering = clClusterer.cluster(inputSet);
>
>         Iterator<Set<String>> itr = clClustering.iterator();
>         while(itr.hasNext()){
>
>          Set<String> brandsSet = itr.next();
>          clusterBuilder.append("[");
>          for(String aBrand: brandsSet){
>          clusterBuilder.append(aBrand + ",");
>          }
>          clusterBuilder.append("]");
>         }
>         group.set(clusterBuilder.toString());
>         clusterBuilder = new StringBuilder();
>         context.write(key, group);
>
>  }
> }
>
>
>
> Thanks,
> -Ahmed
>



-- 
Steven M. Lewis PhD
4221 105th Ave NE
Kirkland, WA 98033
206-384-1340 (cell)
Skype lordjoe_com