You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Charlie Williams <cw...@gmail.com> on 2007/02/06 14:42:33 UTC
JobConf Questions
I am very new to the Nutch source code, and have been reading over the
Injector class code. From what I understood of the MapReduce system there
had to be both a map and reduce step in order for the algorithm to function
properly. However, in CrawlDb.createJob( Configuration, Path ) a new job is
created for merging the injected URLs that has no Mapper Class set.
..
JobConf job = new NutchJob(config);
job.setJobNmae("crawldb " + crawlDb);
Path current = new Path(crawlDb, CrawlDatum.DB_DIR_NAME);
if ( FileSystem.get( job ).exists( current ) ) {
job.addInputPath( current );
}
job.setInputFormat( SequenceFileInputFormat.class );
job.setInputKeyClass( UTF8.class );
job.setInputValueClass( CrawlDatum.class );
job.setReducerClass( CrawlDbReducer.class );
job.setOutputPath( newCrawlDb);
job.setOutputFormat( MapFileOutputFormat.class );
job.setOutputKeyClass( UTF8.class );
job.setOutputValueClass( CrawlDatum.class );
return job;
How does this code function properly?
Is it designed to only run on a single machine and thus does not need a
mapper function set?
Thanks for any help,
-Charles Williams
Re: JobConf Questions
Posted by Charlie Williams <cw...@gmail.com>.
thanks for the clarification!
-Charlie Williams
On 2/6/07, Dennis Kubes <nu...@dragonflymc.com> wrote:
>
> If no mapper or reducer class is set in the jobConf then the code
> defaults to IdentityMapper and IdentityReducer respectively which
> essentially are pass throughs of key/value pairs.
>
> Dennis Kubes
>
> Charlie Williams wrote:
> > I am very new to the Nutch source code, and have been reading over the
> > Injector class code. From what I understood of the MapReduce
> system there
> > had to be both a map and reduce step in order for the algorithm to
> function
> > properly. However, in CrawlDb.createJob( Configuration, Path ) a new job
> is
> > created for merging the injected URLs that has no Mapper Class set.
> >
> > ..
> >
> > JobConf job = new NutchJob(config);
> > job.setJobNmae("crawldb " + crawlDb);
> >
> >
> > Path current = new Path(crawlDb, CrawlDatum.DB_DIR_NAME);
> > if ( FileSystem.get( job ).exists( current ) ) {
> > job.addInputPath( current );
> > }
> >
> > job.setInputFormat( SequenceFileInputFormat.class );
> > job.setInputKeyClass( UTF8.class );
> > job.setInputValueClass( CrawlDatum.class );
> >
> > job.setReducerClass( CrawlDbReducer.class );
> >
> > job.setOutputPath( newCrawlDb);
> > job.setOutputFormat( MapFileOutputFormat.class );
> > job.setOutputKeyClass( UTF8.class );
> > job.setOutputValueClass( CrawlDatum.class );
> >
> > return job;
> >
> >
> > How does this code function properly?
> >
> > Is it designed to only run on a single machine and thus does not need a
> > mapper function set?
> >
> > Thanks for any help,
> >
> > -Charles Williams
> >
>
Re: JobConf Questions
Posted by Dennis Kubes <nu...@dragonflymc.com>.
If no mapper or reducer class is set in the jobConf then the code
defaults to IdentityMapper and IdentityReducer respectively which
essentially are pass throughs of key/value pairs.
Dennis Kubes
Charlie Williams wrote:
> I am very new to the Nutch source code, and have been reading over the
> Injector class code. From what I understood of the MapReduce system there
> had to be both a map and reduce step in order for the algorithm to function
> properly. However, in CrawlDb.createJob( Configuration, Path ) a new job is
> created for merging the injected URLs that has no Mapper Class set.
>
> ..
>
> JobConf job = new NutchJob(config);
> job.setJobNmae("crawldb " + crawlDb);
>
>
> Path current = new Path(crawlDb, CrawlDatum.DB_DIR_NAME);
> if ( FileSystem.get( job ).exists( current ) ) {
> job.addInputPath( current );
> }
>
> job.setInputFormat( SequenceFileInputFormat.class );
> job.setInputKeyClass( UTF8.class );
> job.setInputValueClass( CrawlDatum.class );
>
> job.setReducerClass( CrawlDbReducer.class );
>
> job.setOutputPath( newCrawlDb);
> job.setOutputFormat( MapFileOutputFormat.class );
> job.setOutputKeyClass( UTF8.class );
> job.setOutputValueClass( CrawlDatum.class );
>
> return job;
>
>
> How does this code function properly?
>
> Is it designed to only run on a single machine and thus does not need a
> mapper function set?
>
> Thanks for any help,
>
> -Charles Williams
>