You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by "zhao zhendong (JIRA)" <ji...@apache.org> on 2009/12/20 06:38:18 UTC

[jira] Created: (MAHOUT-227) Parallel SVM

Parallel SVM
------------

                 Key: MAHOUT-227
                 URL: https://issues.apache.org/jira/browse/MAHOUT-227
             Project: Mahout
          Issue Type: Task
          Components: Classification
            Reporter: zhao zhendong


I wrote a proposal of parallel algorithm for SVM training. Any comment is welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: [jira] Commented: (MAHOUT-227) Parallel SVM

Posted by zhao zhendong <zh...@gmail.com>.

Thanks.
On Tue, Dec 22, 2009 at 11:40 AM, Ted Dunning (JIRA) <ji...@apache.org>wrote:

>
>    [
> https://issues.apache.org/jira/browse/MAHOUT-227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12793487#action_12793487]
>
> Ted Dunning commented on MAHOUT-227:
> ------------------------------------
>
> {quote}
> I understand this concern. Actually, if we set the parameter k to 1,000,000
> or higher, do you think it is reasonable to take advantage of Map-reduce
> framework? I mean, from system implementation's view.
> {quote}
>
> If you increase the value of k to very large values, you will be able to
> get a bit more computation, but if you follow my small cluster example I
> think that increasing k from 1000 to 1,000,000 will likely increase
> efficiency from 0.1% to less than 50% and will drive the algorithm well
> beyond the region were kT is constant.  You will still have quite a lot of
> I/O per cycle which may prevent you from achieving even 10% efficiency.
>

> For  larger clusters, the problem will be much worse.
>
> Go ahead and try it, though.  Your real results count for more than my
> estimates.
>

Ok, I will try this first.


> And as I said before, getting a good sequential implementation is of real
> value as well.
>

Could you please specify this sequential implementation?


> > Parallel SVM
> > ------------
> >
> >                 Key: MAHOUT-227
> >                 URL: https://issues.apache.org/jira/browse/MAHOUT-227
> >             Project: Mahout
> >          Issue Type: Task
> >          Components: Classification
> >            Reporter: zhao zhendong
> >         Attachments: ParallelPegasos.doc, ParallelPegasos.pdf
> >
> >
> > I wrote a proposal of parallel algorithm for SVM training. Any comment is
> welcome.
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>


-- 
-------------------------------------------------------------

Zhen-Dong Zhao (Maxim)

<><<><><><><><><><>><><><><><>>>>>>

Department of Computer Science
School of Computing
National University of Singapore

><><><><><><><><><><><><><><><><<<<
Homepage:http://zhaozhendong.googlepages.com
Mail: zhaozhendong@gmail.com
>>>>>>><><><><><><><><<><>><><<<<<<

Re: [jira] Commented: (MAHOUT-227) Parallel SVM

Posted by zhao zhendong <zh...@gmail.com>.

Hi,

I see. Thanks for your explanation. I thought that every thing in Mahout
should be parallelized.

I agree with Ted, to extend k may not obtain any improvement, especially,
within the large cluster case. *The lager-scale learning, however, at least
has two levels, one is for algorithm and another is for data storage or
caching.* With hadoop, users can store a large scale dataset in cluster or
even load the dataset to memory, then they could perform a training process
using the sequential implementation of Pegasos.

Whether my understanding is correct?

Cheers,
Zhendong

On Tue, Dec 22, 2009 at 1:55 PM, Jake Mannix <ja...@gmail.com> wrote:

> Zhao,
>
>  Mahout is not just for hadoop-based implementations.  We are interested in
> "scalable
> machine learning" - we currently have *no* SVM implementations in Mahout,
> and would
> welcome an easy simple straightforward SVM, and would find something like
> the original
> Pegasos implemented in our APIs also an excellent addition.
>
>  If at some point we added a fully parallelized hadoop-based Pegasos, that
> would be
> great, sure, but we don't require everything contributed to Mahout to run
> on
> Hadoop.
> Currently quite a bit of our libraries have nothing parallel about them
> yet,
> but they are
> all aimed to be able to scale to large data sets.
>
>  Does this make sense?
>
>  -jake
>
> On Mon, Dec 21, 2009 at 9:21 PM, zhao zhendong <zhaozhendong@gmail.com
> >wrote:
>
> > {quote}
> > k = 1
> > Otherwise as in the Pegasos article.  No parallelism.
> > {quote}
> >
> > I confused. As the consequence, what is the motivation behind integrating
> > the Pegasos into Mahout.
> >
> > Can you estimate that in which situation, this implementation can
> > outperform
> > the original Pegasos? Large-scale data set or any other concern?
> >
> > With this implementation, how can we take advantage of Map-reduce
> > framework?
> >
> >
> > On Tue, Dec 22, 2009 at 12:44 PM, Ted Dunning (JIRA) <jira@apache.org
> > >wrote:
> >
> > >
> > >    [
> > >
> >
> https://issues.apache.org/jira/browse/MAHOUT-227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12793497#action_12793497
> > ]
> > >
> > > Ted Dunning commented on MAHOUT-227:
> > > ------------------------------------
> > >
> > > {quote}
> > > Can you specify this sequential implementation?
> > > {quote}
> > >
> > > k = 1
> > >
> > > Otherwise as in the Pegasos article.
> > >
> > >
> > > > Parallel SVM
> > > > ------------
> > > >
> > > >                 Key: MAHOUT-227
> > > >                 URL:
> https://issues.apache.org/jira/browse/MAHOUT-227
> > > >             Project: Mahout
> > > >          Issue Type: Task
> > > >          Components: Classification
> > > >            Reporter: zhao zhendong
> > > >         Attachments: ParallelPegasos.doc, ParallelPegasos.pdf
> > > >
> > > >
> > > > I wrote a proposal of parallel algorithm for SVM training. Any
> comment
> > is
> > > welcome.
> > >
> > > --
> > > This message is automatically generated by JIRA.
> > > -
> > > You can reply to this email to add a comment to the issue online.
> > >
> > >
> >
> >
> > --
> > -------------------------------------------------------------
> >
> > Zhen-Dong Zhao (Maxim)
> >
> > <><<><><><><><><><>><><><><><>>>>>>
> >
> > Department of Computer Science
> > School of Computing
> > National University of Singapore
> >
> > ><><><><><><><><><><><><><><><><<<<
> > Homepage:http://zhaozhendong.googlepages.com
> > Mail: zhaozhendong@gmail.com
> > >>>>>>><><><><><><><><<><>><><<<<<<
> >
>



-- 
-------------------------------------------------------------

Zhen-Dong Zhao (Maxim)

Re: [jira] Commented: (MAHOUT-227) Parallel SVM

Posted by Jake Mannix <ja...@gmail.com>.

Zhao,

  Mahout is not just for hadoop-based implementations.  We are interested in
"scalable
machine learning" - we currently have *no* SVM implementations in Mahout,
and would
welcome an easy simple straightforward SVM, and would find something like
the original
Pegasos implemented in our APIs also an excellent addition.

  If at some point we added a fully parallelized hadoop-based Pegasos, that
would be
great, sure, but we don't require everything contributed to Mahout to run on
Hadoop.
Currently quite a bit of our libraries have nothing parallel about them yet,
but they are
all aimed to be able to scale to large data sets.

  Does this make sense?

  -jake

On Mon, Dec 21, 2009 at 9:21 PM, zhao zhendong <zh...@gmail.com>wrote:

> {quote}
> k = 1
> Otherwise as in the Pegasos article.  No parallelism.
> {quote}
>
> I confused. As the consequence, what is the motivation behind integrating
> the Pegasos into Mahout.
>
> Can you estimate that in which situation, this implementation can
> outperform
> the original Pegasos? Large-scale data set or any other concern?
>
> With this implementation, how can we take advantage of Map-reduce
> framework?
>
>
> On Tue, Dec 22, 2009 at 12:44 PM, Ted Dunning (JIRA) <jira@apache.org
> >wrote:
>
> >
> >    [
> >
> https://issues.apache.org/jira/browse/MAHOUT-227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12793497#action_12793497
> ]
> >
> > Ted Dunning commented on MAHOUT-227:
> > ------------------------------------
> >
> > {quote}
> > Can you specify this sequential implementation?
> > {quote}
> >
> > k = 1
> >
> > Otherwise as in the Pegasos article.
> >
> >
> > > Parallel SVM
> > > ------------
> > >
> > >                 Key: MAHOUT-227
> > >                 URL: https://issues.apache.org/jira/browse/MAHOUT-227
> > >             Project: Mahout
> > >          Issue Type: Task
> > >          Components: Classification
> > >            Reporter: zhao zhendong
> > >         Attachments: ParallelPegasos.doc, ParallelPegasos.pdf
> > >
> > >
> > > I wrote a proposal of parallel algorithm for SVM training. Any comment
> is
> > welcome.
> >
> > --
> > This message is automatically generated by JIRA.
> > -
> > You can reply to this email to add a comment to the issue online.
> >
> >
>
>
> --
> -------------------------------------------------------------
>
> Zhen-Dong Zhao (Maxim)
>
> <><<><><><><><><><>><><><><><>>>>>>
>
> Department of Computer Science
> School of Computing
> National University of Singapore
>
> ><><><><><><><><><><><><><><><><<<<
> Homepage:http://zhaozhendong.googlepages.com
> Mail: zhaozhendong@gmail.com
> >>>>>>><><><><><><><><<><>><><<<<<<
>

Re: [jira] Commented: (MAHOUT-227) Parallel SVM

Posted by zhao zhendong <zh...@gmail.com>.

{quote}
k = 1
Otherwise as in the Pegasos article.  No parallelism.
{quote}

I confused. As the consequence, what is the motivation behind integrating
the Pegasos into Mahout.

Can you estimate that in which situation, this implementation can outperform
the original Pegasos? Large-scale data set or any other concern?

With this implementation, how can we take advantage of Map-reduce framework?


On Tue, Dec 22, 2009 at 12:44 PM, Ted Dunning (JIRA) <ji...@apache.org>wrote:

>
>    [
> https://issues.apache.org/jira/browse/MAHOUT-227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12793497#action_12793497]
>
> Ted Dunning commented on MAHOUT-227:
> ------------------------------------
>
> {quote}
> Can you specify this sequential implementation?
> {quote}
>
> k = 1
>
> Otherwise as in the Pegasos article.
>
>
> > Parallel SVM
> > ------------
> >
> >                 Key: MAHOUT-227
> >                 URL: https://issues.apache.org/jira/browse/MAHOUT-227
> >             Project: Mahout
> >          Issue Type: Task
> >          Components: Classification
> >            Reporter: zhao zhendong
> >         Attachments: ParallelPegasos.doc, ParallelPegasos.pdf
> >
> >
> > I wrote a proposal of parallel algorithm for SVM training. Any comment is
> welcome.
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>


-- 
-------------------------------------------------------------

Zhen-Dong Zhao (Maxim)

<><<><><><><><><><>><><><><><>>>>>>

Department of Computer Science
School of Computing
National University of Singapore

><><><><><><><><><><><><><><><><<<<
Homepage:http://zhaozhendong.googlepages.com
Mail: zhaozhendong@gmail.com
>>>>>>><><><><><><><><<><>><><<<<<<

Re: [jira] Commented: (MAHOUT-227) Parallel SVM

Posted by zhao zhendong <zh...@gmail.com>.

Oops.

I directly attached the files to this issue just now. I think this is the
simplest way to share the documents with you guys.

Thanks.

On Mon, Dec 21, 2009 at 10:55 AM, Ted Dunning <te...@gmail.com> wrote:

> Putting preliminary documents onto the JIRA is fine.  Putting it on the
> wiki
> is fine as well.  The problem is that the the patch that you posted didn't
> have anything in it.
>
> On Sun, Dec 20, 2009 at 6:12 PM, zhao zhendong <zhaozhendong@gmail.com
> >wrote:
>
> > Thanks.
> >
> > Ok, I will put the proposal on the wiki late today.
> >
> > Grant Ingersoll suggested me to share the proposal  as a patch, I think
> > that
> > he may mean the source code instead of documents.
> > *
> > *
> >
> >
> > On Mon, Dec 21, 2009 at 8:03 AM, David Hall (JIRA) <ji...@apache.org>
> > wrote:
> >
> > >
> > >    [
> > >
> >
> https://issues.apache.org/jira/browse/MAHOUT-227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12793053#action_12793053
> > ]
> > >
> > > David Hall commented on MAHOUT-227:
> > > -----------------------------------
> > >
> > > As Ted hints, a proposal should really be placed on the wiki.
> > > http://cwiki.apache.org/MAHOUT/
> > >
> > > Looking forward to it.
> > >
> > > > Parallel SVM
> > > > ------------
> > > >
> > > >                 Key: MAHOUT-227
> > > >                 URL:
> https://issues.apache.org/jira/browse/MAHOUT-227
> > > >             Project: Mahout
> > > >          Issue Type: Task
> > > >          Components: Classification
> > > >            Reporter: zhao zhendong
> > > >         Attachments: svmProposal.patch
> > > >
> > > >
> > > > I wrote a proposal of parallel algorithm for SVM training. Any
> comment
> > is
> > > welcome.
> > >
> > > --
> > > This message is automatically generated by JIRA.
> > > -
> > > You can reply to this email to add a comment to the issue online.
> > >
> > >
> >
> >
> > --
> > -------------------------------------------------------------
> >
> > Zhen-Dong Zhao (Maxim)
> >
> > <><<><><><><><><><>><><><><><>>>>>>
> >
> > Department of Computer Science
> > School of Computing
> > National University of Singapore
> >
> > ><><><><><><><><><><><><><><><><<<<
> > Homepage:http://zhaozhendong.googlepages.com
> > Mail: zhaozhendong@gmail.com
> > >>>>>>><><><><><><><><<><>><><<<<<<
> >
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>



-- 
-------------------------------------------------------------

Zhen-Dong Zhao (Maxim)

<><<><><><><><><><>><><><><><>>>>>>

Department of Computer Science
School of Computing
National University of Singapore

><><><><><><><><><><><><><><><><<<<
Homepage:http://zhaozhendong.googlepages.com
Mail: zhaozhendong@gmail.com
>>>>>>><><><><><><><><<><>><><<<<<<

Re: [jira] Commented: (MAHOUT-227) Parallel SVM

Posted by Ted Dunning <te...@gmail.com>.

Putting preliminary documents onto the JIRA is fine.  Putting it on the wiki
is fine as well.  The problem is that the the patch that you posted didn't
have anything in it.

On Sun, Dec 20, 2009 at 6:12 PM, zhao zhendong <zh...@gmail.com>wrote:

> Thanks.
>
> Ok, I will put the proposal on the wiki late today.
>
> Grant Ingersoll suggested me to share the proposal  as a patch, I think
> that
> he may mean the source code instead of documents.
> *
> *
>
>
> On Mon, Dec 21, 2009 at 8:03 AM, David Hall (JIRA) <ji...@apache.org>
> wrote:
>
> >
> >    [
> >
> https://issues.apache.org/jira/browse/MAHOUT-227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12793053#action_12793053
> ]
> >
> > David Hall commented on MAHOUT-227:
> > -----------------------------------
> >
> > As Ted hints, a proposal should really be placed on the wiki.
> > http://cwiki.apache.org/MAHOUT/
> >
> > Looking forward to it.
> >
> > > Parallel SVM
> > > ------------
> > >
> > >                 Key: MAHOUT-227
> > >                 URL: https://issues.apache.org/jira/browse/MAHOUT-227
> > >             Project: Mahout
> > >          Issue Type: Task
> > >          Components: Classification
> > >            Reporter: zhao zhendong
> > >         Attachments: svmProposal.patch
> > >
> > >
> > > I wrote a proposal of parallel algorithm for SVM training. Any comment
> is
> > welcome.
> >
> > --
> > This message is automatically generated by JIRA.
> > -
> > You can reply to this email to add a comment to the issue online.
> >
> >
>
>
> --
> -------------------------------------------------------------
>
> Zhen-Dong Zhao (Maxim)
>
> <><<><><><><><><><>><><><><><>>>>>>
>
> Department of Computer Science
> School of Computing
> National University of Singapore
>
> ><><><><><><><><><><><><><><><><<<<
> Homepage:http://zhaozhendong.googlepages.com
> Mail: zhaozhendong@gmail.com
> >>>>>>><><><><><><><><<><>><><<<<<<
>



-- 
Ted Dunning, CTO
DeepDyve

Re: [jira] Commented: (MAHOUT-227) Parallel SVM

Posted by zhao zhendong <zh...@gmail.com>.

Thanks.

Ok, I will put the proposal on the wiki late today.

Grant Ingersoll suggested me to share the proposal  as a patch, I think that
he may mean the source code instead of documents.
*
*


On Mon, Dec 21, 2009 at 8:03 AM, David Hall (JIRA) <ji...@apache.org> wrote:

>
>    [
> https://issues.apache.org/jira/browse/MAHOUT-227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12793053#action_12793053]
>
> David Hall commented on MAHOUT-227:
> -----------------------------------
>
> As Ted hints, a proposal should really be placed on the wiki.
> http://cwiki.apache.org/MAHOUT/
>
> Looking forward to it.
>
> > Parallel SVM
> > ------------
> >
> >                 Key: MAHOUT-227
> >                 URL: https://issues.apache.org/jira/browse/MAHOUT-227
> >             Project: Mahout
> >          Issue Type: Task
> >          Components: Classification
> >            Reporter: zhao zhendong
> >         Attachments: svmProposal.patch
> >
> >
> > I wrote a proposal of parallel algorithm for SVM training. Any comment is
> welcome.
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>


-- 
-------------------------------------------------------------

Zhen-Dong Zhao (Maxim)

<><<><><><><><><><>><><><><><>>>>>>

Department of Computer Science
School of Computing
National University of Singapore

><><><><><><><><><><><><><><><><<<<
Homepage:http://zhaozhendong.googlepages.com
Mail: zhaozhendong@gmail.com
>>>>>>><><><><><><><><<><>><><<<<<<

[jira] Commented: (MAHOUT-227) Parallel SVM

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12793021#action_12793021 ] 

Ted Dunning commented on MAHOUT-227:
------------------------------------


Also, you put the files deep into the source code tree.  They shouldn't be down there in the end, but will need to be put onto the wiki or into javadoc form.


> Parallel SVM
> ------------
>
>                 Key: MAHOUT-227
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-227
>             Project: Mahout
>          Issue Type: Task
>          Components: Classification
>            Reporter: zhao zhendong
>         Attachments: svmProposal.patch
>
>
> I wrote a proposal of parallel algorithm for SVM training. Any comment is welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-227) Parallel SVM

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12793487#action_12793487 ] 

Ted Dunning commented on MAHOUT-227:
------------------------------------

{quote}
I understand this concern. Actually, if we set the parameter k to 1,000,000
or higher, do you think it is reasonable to take advantage of Map-reduce
framework? I mean, from system implementation's view.
{quote}

If you increase the value of k to very large values, you will be able to get a bit more computation, but if you follow my small cluster example I think that increasing k from 1000 to 1,000,000 will likely increase efficiency from 0.1% to less than 50% and will drive the algorithm well beyond the region were kT is constant.  You will still have quite a lot of I/O per cycle which may prevent you from achieving even 10% efficiency.

For  larger clusters, the problem will be much worse.

Go ahead and try it, though.  Your real results count for more than my estimates.

And as I said before, getting a good sequential implementation is of real value as well.

> Parallel SVM
> ------------
>
>                 Key: MAHOUT-227
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-227
>             Project: Mahout
>          Issue Type: Task
>          Components: Classification
>            Reporter: zhao zhendong
>         Attachments: ParallelPegasos.doc, ParallelPegasos.pdf
>
>
> I wrote a proposal of parallel algorithm for SVM training. Any comment is welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-227) Parallel SVM

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ted Dunning updated MAHOUT-227:
-------------------------------

    Status: Open  (was: Patch Available)

The patch doesn't contain any code.

> Parallel SVM
> ------------
>
>                 Key: MAHOUT-227
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-227
>             Project: Mahout
>          Issue Type: Task
>          Components: Classification
>            Reporter: zhao zhendong
>         Attachments: svmProposal.patch
>
>
> I wrote a proposal of parallel algorithm for SVM training. Any comment is welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (MAHOUT-227) Parallel SVM

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12793497#action_12793497 ] 

Ted Dunning edited comment on MAHOUT-227 at 12/22/09 4:42 AM:
--------------------------------------------------------------

{quote}
Can you specify this sequential implementation?
{quote}

k = 1

Otherwise as in the Pegasos article.  No parallelism.


      was (Author: tdunning):
    {quote}
Can you specify this sequential implementation?
{quote}

k = 1

Otherwise as in the Pegasos article.

  
> Parallel SVM
> ------------
>
>                 Key: MAHOUT-227
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-227
>             Project: Mahout
>          Issue Type: Task
>          Components: Classification
>            Reporter: zhao zhendong
>         Attachments: ParallelPegasos.doc, ParallelPegasos.pdf
>
>
> I wrote a proposal of parallel algorithm for SVM training. Any comment is welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-227) Parallel SVM

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12793088#action_12793088 ] 

Ted Dunning commented on MAHOUT-227:
------------------------------------

Here are a few formatting suggestions:

a) when cutting and pasting from somebody else's work, it is good to point this out.  You should directly credit figure 3 and the algorithm pseudo-code which are cut-and-pasted directly from the original paper.

b) text in your diagram got resized and is now only partially readable.  This makes it a bit harder to follow exactly what you intend.


More importantly, the parameter k in the original paper is a batch size.  You propose to parallelize the computation of each batch, but otherwise leave the main structure of the computation in place.  If we assume a small cluster with, say 100 cores (12 machines or so), then if you set k to 1000, each core will get to do about a dozen vector operations.  This is likely to be no more than a microsecond of computation per core per iteration.  My guess is that this will result in very, very poor CPU utilization since you will require on map-reduce invocation per iteration.  Concretely put, you will have about a millisecond of useful computation every 10 seconds or so.  

You approach would probably work much better if applied to a single multi-core machine where the very high rendezvous rate would be more achievable.  I don't expect that this proposed approach will work with map-reduce.

On the other hand, Pegasos is a pretty scalable algorithm even on a single machine.  If you were able to produce a high quality sequential implementation, that would be a substantial contribution to Mahout.


> Parallel SVM
> ------------
>
>                 Key: MAHOUT-227
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-227
>             Project: Mahout
>          Issue Type: Task
>          Components: Classification
>            Reporter: zhao zhendong
>         Attachments: ParallelPegasos.doc, ParallelPegasos.pdf
>
>
> I wrote a proposal of parallel algorithm for SVM training. Any comment is welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-227) Parallel SVM

Posted by "David Hall (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12793053#action_12793053 ] 

David Hall commented on MAHOUT-227:
-----------------------------------

As Ted hints, a proposal should really be placed on the wiki. http://cwiki.apache.org/MAHOUT/

Looking forward to it.

> Parallel SVM
> ------------
>
>                 Key: MAHOUT-227
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-227
>             Project: Mahout
>          Issue Type: Task
>          Components: Classification
>            Reporter: zhao zhendong
>         Attachments: svmProposal.patch
>
>
> I wrote a proposal of parallel algorithm for SVM training. Any comment is welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-227) Parallel SVM

Posted by "zhao zhendong (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

zhao zhendong updated MAHOUT-227:
---------------------------------

    Attachment:     (was: svmProposal.patch)

> Parallel SVM
> ------------
>
>                 Key: MAHOUT-227
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-227
>             Project: Mahout
>          Issue Type: Task
>          Components: Classification
>            Reporter: zhao zhendong
>         Attachments: ParallelPegasos.doc, ParallelPegasos.pdf
>
>
> I wrote a proposal of parallel algorithm for SVM training. Any comment is welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-227) Parallel SVM

Posted by "zhao zhendong (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12793111#action_12793111 ] 

zhao zhendong commented on MAHOUT-227:
--------------------------------------

Thanks for your comments.


Sure, actually, I have pointed out before "the framework of this
implementation is:"


Thanks, I will revise it later.


I understand this concern. Actually, if we set the parameter k to 1,000,000
or higher, do you think it is reasonable to take advantage of Map-reduce
framework? I mean, from system implementation's view.




Does there have any other thinking about how to extend this algorithm to a
parallel version?


Yeap. That's why we discuss this issue. But, I really need some comments and
helps due I am still a newbie here.



-- 
-------------------------------------------------------------

Zhen-Dong Zhao (Maxim)

<><<><><><><><><><>><><><><><>>>>>>

Department of Computer Science
School of Computing
National University of Singapore

Homepage:http://zhaozhendong.googlepages.com
Mail: zhaozhendong@gmail.com


> Parallel SVM
> ------------
>
>                 Key: MAHOUT-227
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-227
>             Project: Mahout
>          Issue Type: Task
>          Components: Classification
>            Reporter: zhao zhendong
>         Attachments: ParallelPegasos.doc, ParallelPegasos.pdf
>
>
> I wrote a proposal of parallel algorithm for SVM training. Any comment is welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-227) Parallel SVM

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12793497#action_12793497 ] 

Ted Dunning commented on MAHOUT-227:
------------------------------------

{quote}
Can you specify this sequential implementation?
{quote}

k = 1

Otherwise as in the Pegasos article.


> Parallel SVM
> ------------
>
>                 Key: MAHOUT-227
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-227
>             Project: Mahout
>          Issue Type: Task
>          Components: Classification
>            Reporter: zhao zhendong
>         Attachments: ParallelPegasos.doc, ParallelPegasos.pdf
>
>
> I wrote a proposal of parallel algorithm for SVM training. Any comment is welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-227) Parallel SVM

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen updated MAHOUT-227:
-----------------------------

    Fix Version/s:     (was: 0.3)
                   0.4

Moving to 0.4 per Zhao's comment

> Parallel SVM
> ------------
>
>                 Key: MAHOUT-227
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-227
>             Project: Mahout
>          Issue Type: Task
>          Components: Classification
>    Affects Versions: 0.2
>            Reporter: zhao zhendong
>             Fix For: 0.4
>
>         Attachments: ParallelPegasos.doc, ParallelPegasos.pdf
>
>
> I wrote a proposal of parallel algorithm for SVM training. Any comment is welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-227) Parallel SVM

Posted by "zhao zhendong (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

zhao zhendong updated MAHOUT-227:
---------------------------------

    Status: Patch Available  (was: Open)

The patch is a document, say the proposal of parallel algorithm for SVM training. 

> Parallel SVM
> ------------
>
>                 Key: MAHOUT-227
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-227
>             Project: Mahout
>          Issue Type: Task
>          Components: Classification
>            Reporter: zhao zhendong
>
> I wrote a proposal of parallel algorithm for SVM training. Any comment is welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-227) Parallel SVM

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12831648#action_12831648 ] 

Ted Dunning commented on MAHOUT-227:
------------------------------------

Is this going to be complete this week or next?

If not, we should push it to 0.4

(and given the current state, I would guess that there is no other option)

> Parallel SVM
> ------------
>
>                 Key: MAHOUT-227
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-227
>             Project: Mahout
>          Issue Type: Task
>          Components: Classification
>    Affects Versions: 0.2
>            Reporter: zhao zhendong
>             Fix For: 0.3
>
>         Attachments: ParallelPegasos.doc, ParallelPegasos.pdf
>
>
> I wrote a proposal of parallel algorithm for SVM training. Any comment is welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-227) Parallel SVM

Posted by "zhao zhendong (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12831837#action_12831837 ] 

zhao zhendong commented on MAHOUT-227:
--------------------------------------

So far, I didn't  work on this parallel Binary Classification, therefore it
could not be pushed in 0.3. Just as discussed with you, I think that the
parallel multi-classification is easier leverage the parallelism due to it
can be decomposed as a set of binary classifiers. Does it make sense?

I may try this issue later. But I do not know whether this method can
achieve even a little bit improvement.

Cheers,
Zhendong




-- 
-------------------------------------------------------------

Zhen-Dong Zhao (Maxim)

<><<><><><><><><><>><><><><><>>>>>>

Department of Computer Science
School of Computing
National University of Singapore



> Parallel SVM
> ------------
>
>                 Key: MAHOUT-227
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-227
>             Project: Mahout
>          Issue Type: Task
>          Components: Classification
>    Affects Versions: 0.2
>            Reporter: zhao zhendong
>             Fix For: 0.3
>
>         Attachments: ParallelPegasos.doc, ParallelPegasos.pdf
>
>
> I wrote a proposal of parallel algorithm for SVM training. Any comment is welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-227) Parallel SVM

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen updated MAHOUT-227:
-----------------------------

        Fix Version/s: 0.3
    Affects Version/s: 0.2

> Parallel SVM
> ------------
>
>                 Key: MAHOUT-227
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-227
>             Project: Mahout
>          Issue Type: Task
>          Components: Classification
>    Affects Versions: 0.2
>            Reporter: zhao zhendong
>             Fix For: 0.3
>
>         Attachments: ParallelPegasos.doc, ParallelPegasos.pdf
>
>
> I wrote a proposal of parallel algorithm for SVM training. Any comment is welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-227) Parallel SVM

Posted by "zhao zhendong (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

zhao zhendong updated MAHOUT-227:
---------------------------------

    Attachment: svmProposal.patch

The patch includes two files with same content (.doc and .pdf).

> Parallel SVM
> ------------
>
>                 Key: MAHOUT-227
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-227
>             Project: Mahout
>          Issue Type: Task
>          Components: Classification
>            Reporter: zhao zhendong
>         Attachments: svmProposal.patch
>
>
> I wrote a proposal of parallel algorithm for SVM training. Any comment is welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-227) Parallel SVM

Posted by "zhao zhendong (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

zhao zhendong updated MAHOUT-227:
---------------------------------

    Attachment: ParallelPegasos.pdf
                ParallelPegasos.doc

These are two distinct files with same content. The files are Proposal of Parallel Pegasos, which is one of the most famous SVM solver. 

> Parallel SVM
> ------------
>
>                 Key: MAHOUT-227
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-227
>             Project: Mahout
>          Issue Type: Task
>          Components: Classification
>            Reporter: zhao zhendong
>         Attachments: ParallelPegasos.doc, ParallelPegasos.pdf, svmProposal.patch
>
>
> I wrote a proposal of parallel algorithm for SVM training. Any comment is welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-227) Parallel SVM

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12793019#action_12793019 ] 

Ted Dunning commented on MAHOUT-227:
------------------------------------


Actually, the patch does not include any text files.

Can you attach the files directly rather than trying to create a patch?  Patches are intended more for proposed code.

 

> Parallel SVM
> ------------
>
>                 Key: MAHOUT-227
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-227
>             Project: Mahout
>          Issue Type: Task
>          Components: Classification
>            Reporter: zhao zhendong
>         Attachments: svmProposal.patch
>
>
> I wrote a proposal of parallel algorithm for SVM training. Any comment is welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-227) Parallel SVM

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12832089#action_12832089 ] 

Ted Dunning commented on MAHOUT-227:
------------------------------------


Zhao,

My thought is that having a good sequential SVM that learns very fast would be almost as scalable as a parallel implementation, especially if it is right next to a good SGD logistic regression implementation.

My guess is that speedup by randomized variable sub-set is likely to be the most effective strategy if we absolutely need to have speedup.  It is also possible that just speeding up the parameter sweeps that are normal practice for any serious data mining would be just about as useful as making learning fast for a single parameter setting.  That would require giving different maps different parameter settings and having each of them read the entire data set.  Each mapper should probably run multiple settings at once so that the data is re-used relatively efficiently.
 


> Parallel SVM
> ------------
>
>                 Key: MAHOUT-227
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-227
>             Project: Mahout
>          Issue Type: Task
>          Components: Classification
>    Affects Versions: 0.2
>            Reporter: zhao zhendong
>             Fix For: 0.4
>
>         Attachments: ParallelPegasos.doc, ParallelPegasos.pdf
>
>
> I wrote a proposal of parallel algorithm for SVM training. Any comment is welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.