You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by phonechen <ph...@gmail.com> on 2008/04/20 05:58:26 UTC

Hadoop Job without Mapper and Reducer class.

Hello all:
When I read the Nutch source code I found that the processDumpJob(String
crawlDb, String output, Configuration config) the in the CrawlDbReader.java
only  set some inputFormat & outputFormat ,and without Mapper and Reducer
class. But it can dump the existing crawldb to a text format.
Can anyone tell me how does it work?
Thanks!

Here is the source code:
-----------------------------------------------
 public void processDumpJob(String crawlDb, String output, Configuration
config) throws IOException {

                    if (LOG.isInfoEnabled()) {
                      LOG.info("CrawlDb dump: starting");
                      LOG.info("CrawlDb db: " + crawlDb);
                    }


                    Path outFolder = new Path(output);

                    JobConf job = new NutchJob(config);
                    job.setJobName("dump " + crawlDb);

                    job.addInputPath(new Path(crawlDb,
CrawlDb.CURRENT_NAME));
                    job.setInputFormat(SequenceFileInputFormat.class);

                    job.setOutputPath(outFolder);
                    job.setOutputFormat(TextOutputFormat.class);
                    job.setOutputKeyClass(Text.class);
                    job.setOutputValueClass(CrawlDatum.class);

                    JobClient.runJob(job);
                    if (LOG.isInfoEnabled()) { LOG.info("CrawlDb dump:
done"); }
                  }
----------------------------------------------------




-- 
--~--~---------~--~----~------------~-------~--

Best Regards,

Yours
Phonechen

-~----------~----~----~----~------~----~------

Re: Hadoop Job without Mapper and Reducer class.

Posted by Enis Soztutar <en...@gmail.com>.

Hi,

JobConf has some default values, which are IdentityMapper and 
IdentityReducer. These functors, as their name implies, does not alter 
the data but pass intact. The dump job does not need to alter the data 
but to transform from (binary) SequenceFile (InputFormat) to text 
(OutputFormat).

phonechen wrote:
> Hello all:
> When I read the Nutch source code I found that the processDumpJob(String
> crawlDb, String output, Configuration config) the in the CrawlDbReader.java
> only  set some inputFormat & outputFormat ,and without Mapper and Reducer
> class. But it can dump the existing crawldb to a text format.
> Can anyone tell me how does it work?
> Thanks!
>
> Here is the source code:
> -----------------------------------------------
>  public void processDumpJob(String crawlDb, String output, Configuration
> config) throws IOException {
>
>                     if (LOG.isInfoEnabled()) {
>                       LOG.info("CrawlDb dump: starting");
>                       LOG.info("CrawlDb db: " + crawlDb);
>                     }
>
>
>                     Path outFolder = new Path(output);
>
>                     JobConf job = new NutchJob(config);
>                     job.setJobName("dump " + crawlDb);
>
>                     job.addInputPath(new Path(crawlDb,
> CrawlDb.CURRENT_NAME));
>                     job.setInputFormat(SequenceFileInputFormat.class);
>
>                     job.setOutputPath(outFolder);
>                     job.setOutputFormat(TextOutputFormat.class);
>                     job.setOutputKeyClass(Text.class);
>                     job.setOutputValueClass(CrawlDatum.class);
>
>                     JobClient.runJob(job);
>                     if (LOG.isInfoEnabled()) { LOG.info("CrawlDb dump:
> done"); }
>                   }
> ----------------------------------------------------
>
>
>
>
>