You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by brad <br...@bcs-mail.net> on 2010/08/03 02:02:30 UTC

Does org.apache.hadoop.mapred.ReduceTask.run have more than one thread?

Hi,
Continuing to have performance problems with the Fetch after fetching is
complete.  When I do a check of jstack, I only show 1 thread for
org.apache.hadoop.mapred.ReduceTask.run.  Does it only have 1 thread when
Nutch only runs on 1 machine?  Is there a way to have more than one thread
to improve performance on a single machine?

This leads me to a few other questions:

1) Why is the URLFilters.filter process run as part of
mapred.ReduceTask.run?

2) When I continually check jstack during the mapred.ReduceTask.run it
appears to be URLFilters.filter or BasicURLNormalizer are being run.  Is
there a way I can change my configuration to improve the performance of
these functions?

3) Could these functions be run prior to fetching the URL to be completely
eliminate it from the mapred.ReduceTask.run process and gain the advantage
of the multiple threads used in the fetch process?

4) Lastly, in trying to look at the bottlenecks I'm experiencing, I looked
at the RegexURLFilter.java.  I was curious why a new Matcher is used in
every usage of match instead of using matcher.reset?  In terms of
performance, my understanding is using reset was preferable to creating a
new matcher.  Below is an example of what I mean.  Just curious.

private class Rule extends RegexRule {
    
    private Pattern pattern;
    private Matcher myMatcher;  //add a matcher
    
    Rule(boolean sign, String regex) {
      super(sign, regex);
      pattern = Pattern.compile(regex);
	myMatcher = pattern.matcher("");//initialize it to blank
    }

    protected boolean match(String url) {
      //return pattern.matcher(url).find();
	  return myMatcher.reset(url).find();//use reset instead of matcher
    }
}

Sorry about all the questions.  I find, at least on a 1 machine Nutch
configuration, the fetch part of the fetcher is much faster that than the 
mapred.ReduceTask.run process and the mapred.ReduceTask process is really
bogging down.

Thank you for your time!
Brad


RE: Does org.apache.hadoop.mapred.ReduceTask.run have more than one thread?

Posted by brad <br...@bcs-mail.net>.
I got the final numbers on fetching 1 million records:

Total Time          29:01:39
Fetch & Parse Time   6:45:32 
MapReduce Time      22:16:07

So, about 75% of a Nutch fetch is spent in the MapReduce portion and only
25% of the time is spent in Fetch and Parse portion.  Is this typical?
Would the result be similar on a cluster of machines vs a single machine?

What can I do to reduce the MapReduce time?

Thanks
Brad

-----Original Message-----
From: brad [mailto:brad@bcs-mail.net] 
Sent: Monday, August 02, 2010 5:03 PM
To: user@nutch.apache.org
Subject: Does org.apache.hadoop.mapred.ReduceTask.run have more than one
thread?

Hi,
Continuing to have performance problems with the Fetch after fetching is
complete.  When I do a check of jstack, I only show 1 thread for
org.apache.hadoop.mapred.ReduceTask.run.  Does it only have 1 thread when
Nutch only runs on 1 machine?  Is there a way to have more than one thread
to improve performance on a single machine?

This leads me to a few other questions:

1) Why is the URLFilters.filter process run as part of
mapred.ReduceTask.run?

2) When I continually check jstack during the mapred.ReduceTask.run it
appears to be URLFilters.filter or BasicURLNormalizer are being run.  Is
there a way I can change my configuration to improve the performance of
these functions?

3) Could these functions be run prior to fetching the URL to be completely
eliminate it from the mapred.ReduceTask.run process and gain the advantage
of the multiple threads used in the fetch process?

4) Lastly, in trying to look at the bottlenecks I'm experiencing, I looked
at the RegexURLFilter.java.  I was curious why a new Matcher is used in
every usage of match instead of using matcher.reset?  In terms of
performance, my understanding is using reset was preferable to creating a
new matcher.  Below is an example of what I mean.  Just curious.

private class Rule extends RegexRule {
    
    private Pattern pattern;
    private Matcher myMatcher;  //add a matcher
    
    Rule(boolean sign, String regex) {
      super(sign, regex);
      pattern = Pattern.compile(regex);
	myMatcher = pattern.matcher("");//initialize it to blank
    }

    protected boolean match(String url) {
      //return pattern.matcher(url).find();
	  return myMatcher.reset(url).find();//use reset instead of matcher
    }
}

Sorry about all the questions.  I find, at least on a 1 machine Nutch
configuration, the fetch part of the fetcher is much faster that than the
mapred.ReduceTask.run process and the mapred.ReduceTask process is really
bogging down.

Thank you for your time!
Brad