You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by phonechen <ph...@gmail.com> on 2008/05/06 03:41:30 UTC

About the Bayes TrainerDriver

hi,all:
I'm using the mahout bayes classifier these days,and here are some question
about it:
1.SequenceFileModelReader only assume that the reduce task number is 1,and
if user's hadoop-site.xml set the default reduce task number to some number
larger than 1,it may get more than 1 reduce results,that is part-00000
part-00001  ... ,some this may lead to some problems.
I think change the path parameter to a directory the contains the reduce
file rather than directly the reduce file is more proper
 *public Model loadModel(FileSystem fs,* *Path path, Configuration conf)
throws IOException *

ar set the num of reduce tasks to 1 in the TrainerDrive#runJob()
*conf.setNumReduceTasks(1);*
**
2.why does the ClassifierDriver class load model data from the HDFS instead
of local filesystem?these can avoid copyToLocal command.

Look forward to feedbacks.
-- 
--~--~---------~--~----~------------~-------~--

Best Regards,

Yours
Phonechen

-~----------~----~----~----~------~----~------

Re: About the Bayes TrainerDriver

Posted by Robin Anil <ro...@gmail.com>.

Good evening. Its monsoon madness here. The whole place is flooded. I hate
this place.

I had posed this on IRC #hadoop  before I went to sleep. I got this reply

<riz0d> But when i try and increase it to multiple reducers ... I get
multiple files part-XXXX in the output folder
<riz0d> What is the best pattern of reading them 1. Loop through the
directory and read one by one
<riz0d> or Run another Map Reduce to do my manipulations on the data
<riz0d> or 3. Use some reading method in hadoop(i am new here) to read the
entire folder outputs
<riz0d> The parts are saved in SequenceFileModel format
<Toad> if you're running another mapreduce to build your model, you can
point it at the directory and it will usually read all the part-* fine
<Toad> if not, might as well loop through and do them one by one... though I
think SequenceFileInputFormat can just take a glob path
<Toad> like /foo/bar/part-*

I tried passing the part-* into the SequenceFile.Reader. But it threw an
exception. After searching for glob files i came across this
  FileStatus<http://hadoop.apache.org/core/docs/r0.16.4/api/org/apache/hadoop/fs/FileStatus.html>
[] *globStatus<http://hadoop.apache.org/core/docs/r0.16.4/api/org/apache/hadoop/fs/FileSystem.html#globStatus%28org.apache.hadoop.fs.Path%29>
*(Path<http://hadoop.apache.org/core/docs/r0.16.4/api/org/apache/hadoop/fs/Path.html>
 pathPattern)
http://hadoop.apache.org/core/docs/r0.16.4/api/org/apache/hadoop/fs/FileSystem.html#globStatus(org.apache.hadoop.fs.Path)

FileStatus.Path gives you the object. Pattern can be part-*

Let me test it out. I will see what comes out

Robin

On Mon, Jun 30, 2008 at 6:08 PM, deneche abdelhakim <a_...@yahoo.fr>
wrote:

> In my case I used a SequenceFileOutputFormat, then I use the following code
> to merge, sort and read the output:
>
> ...
> import org.apache.hadoop.io.SequenceFile.Reader;
> import org.apache.hadoop.io.SequenceFile.Sorter;
> ...
>
>   // list all files in the output path (each reducer should genrate a
> different file)
>   public static Path[] listOutputFiles(FileSystem fs, Path outpath)
>       throws IOException {
>     FileStatus[] status = fs.listStatus(outpath);
>     List<Path> outpaths = new ArrayList<Path>();
>     for (FileStatus s : status) {
>       if (!s.isDir()) {
>         outpaths.add(s.getPath());
>       }
>     }
>
>     Path[] outfiles = new Path[outpaths.size()];
>     outpaths.toArray(outfiles);
>
>     return outfiles;
>   }
>
>   // merge, sort and read all the output files.
>   // the output keys are used to sort the output
>   // in my case the output key is a LongWritable, and the output value is a
> FloatWritable
>   // (wich is converted to a Float), it should not be too difficult to use
> other types
>   // the List evaluations will contain all the output values ordered by
> their respective keys
>   public static void importEvaluations(FileSystem fs, JobConf conf,
>       Path outpath, List<Float> evaluations) throws IOException {
>     Sorter sorter = new Sorter(fs, LongWritable.class, FloatWritable.class,
>         conf);
>
>     // merge and sort the outputs
>     Path[] outfiles = listOutputFiles(fs, outpath);
>     Path output = new Path(outpath, "output.sorted");
>     sorter.merge(outfiles, output);
>
>     // import the evaluations
>     LongWritable key = new LongWritable();
>     FloatWritable value = new FloatWritable();
>     Reader reader = new Reader(fs, output, conf);
>
>     while (reader.next(key, value)) {
>       evaluations.add(value.get());
>     }
>
>     reader.close();
>   }
>
> Hope this helps :P
>
> --- En date de : Lun 30.6.08, Grant Ingersoll <gs...@apache.org> a
> écrit :
> De: Grant Ingersoll <gs...@apache.org>
> Objet: Re: About the Bayes TrainerDriver
> À: mahout-dev@lucene.apache.org
> Date: Lundi 30 Juin 2008, 13h52
>
> I imagine there is a way to combine them or I am just doing something
> stupid.  Won't be the last time for that :-)
>
> -Grant
>
> On Jun 29, 2008, at 11:57 PM, Robin Anil wrote:
>
> > Hi,
> >   I am also stuck in the same place. The Trainer with multiple
> > reducers
> > generated as many number of part files as there are reducers. I have
> > many
> > questions in my mind.
> > In what way are these part files generated?
> > Is there a definite pattern?
> > Inorder to read them should I run another MR job OR read through the
> > files
> > one by one?
> >
> > Any thoughts
> >
> > Robin
>
>
>
>  _____________________________________________________________________________
> Envoyez avec Yahoo! Mail. Une boite mail plus intelligente
> http://mail.yahoo.fr
>

Re: About the Bayes TrainerDriver

Posted by deneche abdelhakim <a_...@yahoo.fr>.

In my case I used a SequenceFileOutputFormat, then I use the following code to merge, sort and read the output:

...
import org.apache.hadoop.io.SequenceFile.Reader;
import org.apache.hadoop.io.SequenceFile.Sorter;
...

  // list all files in the output path (each reducer should genrate a different file)
  public static Path[] listOutputFiles(FileSystem fs, Path outpath)
      throws IOException {
    FileStatus[] status = fs.listStatus(outpath);
    List<Path> outpaths = new ArrayList<Path>();
    for (FileStatus s : status) {
      if (!s.isDir()) {
        outpaths.add(s.getPath());
      }
    }

    Path[] outfiles = new Path[outpaths.size()];
    outpaths.toArray(outfiles);

    return outfiles;
  }

  // merge, sort and read all the output files.
  // the output keys are used to sort the output
  // in my case the output key is a LongWritable, and the output value is a FloatWritable   
  // (wich is converted to a Float), it should not be too difficult to use other types
  // the List evaluations will contain all the output values ordered by their respective keys
  public static void importEvaluations(FileSystem fs, JobConf conf,
      Path outpath, List<Float> evaluations) throws IOException {
    Sorter sorter = new Sorter(fs, LongWritable.class, FloatWritable.class,
        conf);

    // merge and sort the outputs
    Path[] outfiles = listOutputFiles(fs, outpath);
    Path output = new Path(outpath, "output.sorted");
    sorter.merge(outfiles, output);

    // import the evaluations
    LongWritable key = new LongWritable();
    FloatWritable value = new FloatWritable();
    Reader reader = new Reader(fs, output, conf);

    while (reader.next(key, value)) {
      evaluations.add(value.get());
    }

    reader.close();
  }

Hope this helps :P

--- En date de : Lun 30.6.08, Grant Ingersoll <gs...@apache.org> a écrit :
De: Grant Ingersoll <gs...@apache.org>
Objet: Re: About the Bayes TrainerDriver
À: mahout-dev@lucene.apache.org
Date: Lundi 30 Juin 2008, 13h52

I imagine there is a way to combine them or I am just doing something  
stupid.  Won't be the last time for that :-)

-Grant

On Jun 29, 2008, at 11:57 PM, Robin Anil wrote:

> Hi,
>   I am also stuck in the same place. The Trainer with multiple  
> reducers
> generated as many number of part files as there are reducers. I have  
> many
> questions in my mind.
> In what way are these part files generated?
> Is there a definite pattern?
> Inorder to read them should I run another MR job OR read through the  
> files
> one by one?
>
> Any thoughts
>
> Robin

      _____________________________________________________________________________ 
Envoyez avec Yahoo! Mail. Une boite mail plus intelligente http://mail.yahoo.fr

Re: About the Bayes TrainerDriver

Posted by Grant Ingersoll <gs...@apache.org>.

I imagine there is a way to combine them or I am just doing something  
stupid.  Won't be the last time for that :-)

-Grant

On Jun 29, 2008, at 11:57 PM, Robin Anil wrote:

> Hi,
>   I am also stuck in the same place. The Trainer with multiple  
> reducers
> generated as many number of part files as there are reducers. I have  
> many
> questions in my mind.
> In what way are these part files generated?
> Is there a definite pattern?
> Inorder to read them should I run another MR job OR read through the  
> files
> one by one?
>
> Any thoughts
>
> Robin

Re: About the Bayes TrainerDriver

Posted by Robin Anil <ro...@gmail.com>.

Hi,
   I am also stuck in the same place. The Trainer with multiple reducers
generated as many number of part files as there are reducers. I have many
questions in my mind.
 In what way are these part files generated?
 Is there a definite pattern?
 Inorder to read them should I run another MR job OR read through the files
one by one?

 Any thoughts

Robin

Re: About the Bayes TrainerDriver

Posted by Grant Ingersoll <gs...@apache.org>.

Thanks Andrzej!  That makes sense.  I hope to look making the  
Classifier M/R ready in about 1.5 weeks (relatives in town), but if  
someone else wants to tackle it sooner, by all means, jump in.

-Grant

On May 6, 2008, at 4:16 PM, Andrzej Bialecki wrote:

> Grant Ingersoll wrote:
>> On May 6, 2008, at 8:04 AM, phonechen wrote:
>>> sorry , I make a mistake,
>>> what I means is that ,shall we put  the doc to be classified to  
>>> HDFS and
>>> leave the Model files on the HDFS and
>>> make the whole classify process run on the HDFS,
>>> so what to change is :
>>> =====================
>>>  Configuration conf = new JobConf();
>>> FileSystem raw = new RawLocalFileSystem();
>>> raw.setConf(conf);
>>> FileSystem fs = new LocalFileSystem(raw);
>>> ==================
>>> to
>>> ========================
>>>  Configuration conf = new JobConf();
>>>  FileSystem fs = new DistributedFileSystem();
>>> =======================
>
> Speaking as a Hadoop developer ... You should do neither, i.e. you  
> should not instantiate explicitly any FileSystem implementations.  
> There are many reasons for this (object pooling, cleanup, caching,  
> etc).
>
> The canonical idiom for this is the following:
>
> 	FileSystem fs = FileSystem.get(conf);
>
> This way you get either a local FS, or HDFS, or Amazon S3, or KFS,  
> or whatever else is configured as the default filesystem. The  
> benefit is obvious - you don't have to change the code if your  
> configuration changes, i.e. you can transparently move your  
> application from local FS to DFS or S3. Some FS implementations may  
> use pooling, which happens behind the scenes if you use the above.
>
> If you really, really need a local fs, you should use the following  
> idiom:
>
> 	LocalFileSystem localFS = FileSystem.getLocal(conf);
>
> Depending on the configuration (and Hadoop version) you could get  
> different subclasses of a local FS.
>
> Now, what to do if you use something (e.g. HDFS) by default, but you  
> want to make sure that you retrieve some resource that resides on  
> specific other FS? You should use a fully qualified URI when  
> constructing a Path, i.e. a URI that also contains a schema.
>
> Example:
>
> 	Path localPath = new Path("file:///etc/hosts");
> 	Path hdfsPath = new Path("hdfs://namenode:9000/user/data/file");
>
> localPath will use a LocalFileSystem, no matter what FS is the  
> default, and hdfsPath will use DistributedFileSystem that can be  
> reached at the host "namenode" and port 9000, no matter what is the  
> current FS configuration.
>
> And finally - to learn what is the current FileSystem that a Path  
> refers to, do the following:
>
> 	Path unqualified = new Path("/etc/hosts");
> 	FileSystem fs = unqualified.getFileSystem(conf);
>
> You can also make a fully qualified path from a path that is missing  
> explicit schema, and may be relative to your current working  
> directory:
>
> 	Path unqualified = new Path("test");
> 	Path qualified = unqualified.makeQualified(fs);
>
> If your cwd=/home/nutch and your file system is local, then  
> qualified.toString() would give "file:///home/nutch/test".
>
> Hope this helps ...
>
> -- 
> Best regards,
> Andrzej Bialecki     <><
> ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>

--------------------------
Grant Ingersoll

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ

Re: About the Bayes TrainerDriver

Posted by Andrzej Bialecki <ab...@getopt.org>.

Grant Ingersoll wrote:
> 
> On May 6, 2008, at 8:04 AM, phonechen wrote:
> 
>> sorry , I make a mistake,
>> what I means is that ,shall we put  the doc to be classified to HDFS and
>> leave the Model files on the HDFS and
>> make the whole classify process run on the HDFS,
>> so what to change is :
>> =====================
>>   Configuration conf = new JobConf();
>> FileSystem raw = new RawLocalFileSystem();
>>  raw.setConf(conf);
>>  FileSystem fs = new LocalFileSystem(raw);
>> ==================
>> to
>> ========================
>>   Configuration conf = new JobConf();
>>   FileSystem fs = new DistributedFileSystem();
>> =======================

Speaking as a Hadoop developer ... You should do neither, i.e. you 
should not instantiate explicitly any FileSystem implementations. There 
are many reasons for this (object pooling, cleanup, caching, etc).

The canonical idiom for this is the following:

	FileSystem fs = FileSystem.get(conf);

This way you get either a local FS, or HDFS, or Amazon S3, or KFS, or 
whatever else is configured as the default filesystem. The benefit is 
obvious - you don't have to change the code if your configuration 
changes, i.e. you can transparently move your application from local FS 
to DFS or S3. Some FS implementations may use pooling, which happens 
behind the scenes if you use the above.

If you really, really need a local fs, you should use the following idiom:

	LocalFileSystem localFS = FileSystem.getLocal(conf);

Depending on the configuration (and Hadoop version) you could get 
different subclasses of a local FS.

Now, what to do if you use something (e.g. HDFS) by default, but you 
want to make sure that you retrieve some resource that resides on 
specific other FS? You should use a fully qualified URI when 
constructing a Path, i.e. a URI that also contains a schema.

Example:

	Path localPath = new Path("file:///etc/hosts");
	Path hdfsPath = new Path("hdfs://namenode:9000/user/data/file");

localPath will use a LocalFileSystem, no matter what FS is the default, 
and hdfsPath will use DistributedFileSystem that can be reached at the 
host "namenode" and port 9000, no matter what is the current FS 
configuration.

And finally - to learn what is the current FileSystem that a Path refers 
to, do the following:

	Path unqualified = new Path("/etc/hosts");
	FileSystem fs = unqualified.getFileSystem(conf);

You can also make a fully qualified path from a path that is missing 
explicit schema, and may be relative to your current working directory:

	Path unqualified = new Path("test");
	Path qualified = unqualified.makeQualified(fs);

If your cwd=/home/nutch and your file system is local, then 
qualified.toString() would give "file:///home/nutch/test".

Hope this helps ...

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: About the Bayes TrainerDriver

Posted by Grant Ingersoll <gs...@apache.org>.

On May 6, 2008, at 8:04 AM, phonechen wrote:

> sorry , I make a mistake,
> what I means is that ,shall we put  the doc to be classified to HDFS  
> and
> leave the Model files on the HDFS and
> make the whole classify process run on the HDFS,
> so what to change is :
> =====================
>   Configuration conf = new JobConf();
> FileSystem raw = new RawLocalFileSystem();
>  raw.setConf(conf);
>  FileSystem fs = new LocalFileSystem(raw);
> ==================
> to
> ========================
>   Configuration conf = new JobConf();
>   FileSystem fs = new DistributedFileSystem();
> =======================
>
> so we can classify a batch of inputs using mapreduce instead of run  
> multiple
> process of ClassifierDriver ,
> correct me if there are something wrong.
> Ps:Can we make the classify process parallel?
>
>

Sure.  The Driver class is just an easy way to access the Classifier.   
You should be able to call the Classifier as needed as well, and I  
suppose we could add Mapper and Reducer for the classification task as  
well.  I'll look into it the next chance I get, otherwise, feel free  
to update my patch w/ your suggestions.

Cheers,
Grant

Re: About the Bayes TrainerDriver

Posted by phonechen <ph...@gmail.com>.

sorry , I make a mistake,
what I means is that ,shall we put  the doc to be classified to HDFS and
leave the Model files on the HDFS and
make the whole classify process run on the HDFS,
so what to change is :
=====================
   Configuration conf = new JobConf();
 FileSystem raw = new RawLocalFileSystem();
  raw.setConf(conf);
  FileSystem fs = new LocalFileSystem(raw);
==================
to
========================
   Configuration conf = new JobConf();
   FileSystem fs = new DistributedFileSystem();
=======================

so we can classify a batch of inputs using mapreduce instead of run multiple
process of ClassifierDriver ,
correct me if there are something wrong.
Ps:Can we make the classify process parallel?





On 5/6/08, Grant Ingersoll <gs...@apache.org> wrote:
>
>
> On May 5, 2008, at 9:41 PM, phonechen wrote:
>
> hi,all:
> > I'm using the mahout bayes classifier these days,and here are some
> > question
> > about it:
> > 1.SequenceFileModelReader only assume that the reduce task number is
> > 1,and
> > if user's hadoop-site.xml set the default reduce task number to some
> > number
> > larger than 1,it may get more than 1 reduce results,that is part-00000
> > part-00001  ... ,some this may lead to some problems.
> > I think change the path parameter to a directory the contains the reduce
> > file rather than directly the reduce file is more proper
> > *public Model loadModel(FileSystem fs,* *Path path, Configuration conf)
> > throws IOException *
> >
> > ar set the num of reduce tasks to 1 in the TrainerDrive#runJob()
> > *conf.setNumReduceTasks(1);*
> >
>
> Good catch.  We definitely want more than one reduce task.  I will fix
> that and put in tests for multiple reduces.
>
>
> > **
> > 2.why does the ClassifierDriver class load model data from the HDFS
> > instead
> > of local filesystem?these can avoid copyToLocal command
> >
>
> but don't you just have to do a copyToLocal to get it on the local
> filesystem via bin/hadoop dfs -copyToLocal?  I must admit, it has been a
> while since I have done Hadoop stuff, especially the administrative stuff
> (1+ year).  I guess I was thinking you could load in non-distributed mode
> using the LocalFileSystem.  The code I have is:
>
>      Configuration conf = new JobConf();
>      FileSystem raw = new RawLocalFileSystem();
>      raw.setConf(conf);
>      FileSystem fs = new LocalFileSystem(raw);
>      fs.setConf(conf);
>
>      Path path = new Path(cmdLine.getOptionValue(pathOpt.getOpt()));
>      System.out.println("Loading model from: " + path);
>      Model model = reader.loadModel(fs, path, conf);
>
> Thanks for the feedback,
> Grant
>



-- 
--~--~---------~--~----~------------~-------~--

Best Regards,

Yours
Phonechen

-~----------~----~----~----~------~----~------

Re: About the Bayes TrainerDriver

Posted by Grant Ingersoll <gs...@apache.org>.

On May 5, 2008, at 9:41 PM, phonechen wrote:

> hi,all:
> I'm using the mahout bayes classifier these days,and here are some  
> question
> about it:
> 1.SequenceFileModelReader only assume that the reduce task number is  
> 1,and
> if user's hadoop-site.xml set the default reduce task number to some  
> number
> larger than 1,it may get more than 1 reduce results,that is part-00000
> part-00001  ... ,some this may lead to some problems.
> I think change the path parameter to a directory the contains the  
> reduce
> file rather than directly the reduce file is more proper
> *public Model loadModel(FileSystem fs,* *Path path, Configuration  
> conf)
> throws IOException *
>
> ar set the num of reduce tasks to 1 in the TrainerDrive#runJob()
> *conf.setNumReduceTasks(1);*

Good catch.  We definitely want more than one reduce task.  I will fix  
that and put in tests for multiple reduces.

>
> **
> 2.why does the ClassifierDriver class load model data from the HDFS  
> instead
> of local filesystem?these can avoid copyToLocal command

but don't you just have to do a copyToLocal to get it on the local  
filesystem via bin/hadoop dfs -copyToLocal?  I must admit, it has been  
a while since I have done Hadoop stuff, especially the administrative  
stuff (1+ year).  I guess I was thinking you could load in non- 
distributed mode using the LocalFileSystem.  The code I have is:

       Configuration conf = new JobConf();
       FileSystem raw = new RawLocalFileSystem();
       raw.setConf(conf);
       FileSystem fs = new LocalFileSystem(raw);
       fs.setConf(conf);

       Path path = new Path(cmdLine.getOptionValue(pathOpt.getOpt()));
       System.out.println("Loading model from: " + path);
       Model model = reader.loadModel(fs, path, conf);

Thanks for the feedback,
Grant