You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by phonechen <ph...@gmail.com> on 2008/05/26 11:44:24 UTC

About Bayesian Classification on MapReduce

Hi all,
Here are some slide from Google Cluster Computing Faculty Training
Workshop-March.2008.
http://net.pku.edu.cn/~course/cs402/resource/Google%20Cluster%20Computing%20Faculty%20Training%20Workshop-March.2008/

In this one :
http://net.pku.edu.cn/~course/cs402/resource/Google%20Cluster%20Computing%20Faculty%20Training%20Workshop-March.2008/Module%204%20-%20MapReduce%20Theory%20and%20Algorithms.ppt

In Page 45,it introduce the  Bayesian Classification on Map Reduce .
Files containing classification instances are sent to mappers
Map (filename, instance) -> (instance, class)
Identity Reducer
Existing toolsets exist to perform Bayes classification on instance
E.g., WEKA, already in Java!
Another example of discarding input key
Can somebody explain what the "instance"?
I think it is  the content of the file to be classified (after
tokenizing),right?
But why the output of map() use it as a key?
Any ideas?
Thanks!


-- 
--~--~---------~--~----~------------~-------~--

Best Regards,

Yours
Phonechen

-~----------~----~----~----~------~----~------

Re: About Bayesian Classification on MapReduce

Posted by Ted Dunning <te...@gmail.com>.

Just right.

You can set the number of reducers to 0.  That makes it so that the outputs
from the maps are written directly to HDFS without shuffling, sorting or
reducing.

On Mon, May 26, 2008 at 6:39 PM, phonechen <ph...@gmail.com> wrote:

> Ted:
> Thank you for your answer,I've got the idea.
> In my case,I use Nutch as a crawler ,so I have <url,ParseText> pairs which
> store the segments dir as input
> here the ParseText is instance, and then I use the WEKA to classify the
> ParseText and it returns a class,
> so the map() output is <url,class> and the reduce() is identity (I not sure
> this word means "I would have written the code to not have a reducer, in
> which case the output of the map would be the output.  " )
> Right?
>
>

Re: About Bayesian Classification on MapReduce

Posted by phonechen <ph...@gmail.com>.

Ted:
Thank you for your answer,I've got the idea.
In my case,I use Nutch as a crawler ,so I have <url,ParseText> pairs which
store the segments dir as input
here the ParseText is instance, and then I use the WEKA to classify the
ParseText and it returns a class,
so the map() output is <url,class> and the reduce() is identity (I not sure
this word means "I would have written the code to not have a reducer, in
which case the output of the map would be the output.  " )
Right?
Thanks again!





On 5/27/08, Ted Dunning <te...@gmail.com> wrote:
>
> I don't understand some of your questions.  I think I could answer them if
> I
> could.
>
> On Mon, May 26, 2008 at 2:44 AM, phonechen <ph...@gmail.com> wrote:
>
> > Here are some slide from Google Cluster Computing Faculty Training ...
> > <
> http://net.pku.edu.cn/%7Ecourse/cs402/resource/Google%20Cluster%20Computing%20Faculty%20Training%20Workshop-March.2008/Module%204%20-%20MapReduce%20Theory%20and%20Algorithms.ppt
> >In
> > Page 45, it
> >
> introduce the  Bayesian Classification on Map Reduce .Files containing
> > classification instances are sent to mappers
>
>
> OK.
>
>
> > Can somebody explain what the "instance"?
> >
>
> An instance here is a thing to be classified.  The filename is ignored.
>
>
> > Existing toolsets exist to perform Bayes classification on instance
> > E.g., WEKA, already in Java!
>
>
> Yes.  These do exist.  They may or may not be helpful.  Naive Bayesian
> classification, for example is so simple that it doesn't make much sense to
> port existing code as opposed to just write the one or two lines of matrix
> arithmetic needed.
>
> But why the output of map() use it as a key?
>
>
> I would have written the code to not have a reducer, in which case the
> output of the map would be the output.  In that case, you still need to
> supply some sort of key and some sort of value.  The easiest way to do is
> to
> put the instance as key and class as value, but you could as well do it the
> other way around.  You could also put an arbitrary value as key and write a
> reducer that throws it away.  That would let you control the number of
> output files, but would also cost you some time in the shuffle and sort.
>
> Hope this helps.
>



-- 
--~--~---------~--~----~------------~-------~--

Best Regards,

Yours
Phonechen

-~----------~----~----~----~------~----~------

Re: About Bayesian Classification on MapReduce

Posted by Ted Dunning <te...@gmail.com>.

I don't understand some of your questions.  I think I could answer them if I
could.

On Mon, May 26, 2008 at 2:44 AM, phonechen <ph...@gmail.com> wrote:

> Here are some slide from Google Cluster Computing Faculty Training ...
> <http://net.pku.edu.cn/%7Ecourse/cs402/resource/Google%20Cluster%20Computing%20Faculty%20Training%20Workshop-March.2008/Module%204%20-%20MapReduce%20Theory%20and%20Algorithms.ppt>In
> Page 45, it
>
introduce the  Bayesian Classification on Map Reduce .Files containing
> classification instances are sent to mappers

OK.

> Can somebody explain what the "instance"?
>

An instance here is a thing to be classified.  The filename is ignored.

> Existing toolsets exist to perform Bayes classification on instance
> E.g., WEKA, already in Java!

Yes.  These do exist.  They may or may not be helpful.  Naive Bayesian
classification, for example is so simple that it doesn't make much sense to
port existing code as opposed to just write the one or two lines of matrix
arithmetic needed.

But why the output of map() use it as a key?

I would have written the code to not have a reducer, in which case the
output of the map would be the output.  In that case, you still need to
supply some sort of key and some sort of value.  The easiest way to do is to
put the instance as key and class as value, but you could as well do it the
other way around.  You could also put an arbitrary value as key and write a
reducer that throws it away.  That would let you control the number of
output files, but would also cost you some time in the shuffle and sort.

Hope this helps.