You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Thang Nguyen (JIRA)" <ji...@apache.org> on 2016/02/05 03:15:40 UTC

[jira] [Comment Edited] (FLINK-1733) Add PCA to machine learning library

    [ https://issues.apache.org/jira/browse/FLINK-1733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15133521#comment-15133521 ] 

Thang Nguyen edited comment on FLINK-1733 at 2/5/16 2:14 AM:
-------------------------------------------------------------

{quote}
I'm not sure that {{DenseMatrix}} fits for sPCA
{quote}

What if the Matrix is relatively small? 

>From the paper, where d is the # of principal components:
{quote}
matrix C, which is of size D × d (recall that d is typically small). For example, in our experiments with a 94 GB dataset, the size of matrix C was 30 MB, which can easily fit in memory.
{quote}

This matrix C is broadcasted to the workers and is used to redundantly recompute an intermediate matrix (in favor of cutting down communication complexity). The distributed algorithm also only requires accessing a single row at a time to compute a partial result, and then sums the partials at the end. 

Is the lack of a distributed matrix/vector implementation enough of a blocker to be worried, or should I continue?


was (Author: thang):
| I'm not sure that {{DenseMatrix}} fits for sPCA

What if the Matrix is relatively small? 

>From the paper, where d is the # of principal components:

| matrix C, which is of size D × d (recall that d is typically small). For example, in our experiments with a 94 GB dataset, the size of matrix C was 30 MB, which can easily fit in memory.

This matrix C is broadcasted to the workers and is used to redundantly recompute an intermediate matrix (in favor of cutting down communication complexity). The distributed algorithm also only requires accessing a single row at a time to compute a partial result, and then sums the partials at the end. 

Is the lack of a distributed matrix/vector implementation enough of a blocker to be worried, or should I continue?

> Add PCA to machine learning library
> -----------------------------------
>
>                 Key: FLINK-1733
>                 URL: https://issues.apache.org/jira/browse/FLINK-1733
>             Project: Flink
>          Issue Type: New Feature
>          Components: Machine Learning Library
>            Reporter: Till Rohrmann
>            Assignee: Thang Nguyen
>            Priority: Minor
>              Labels: ML
>
> Dimension reduction is a crucial prerequisite for many data analysis tasks. Therefore, Flink's machine learning library should contain a principal components analysis (PCA) implementation. Maria-Florina Balcan et al. [1] proposes a distributed PCA. A more recent publication [2] describes another scalable PCA implementation.
> Resources:
> [1] [http://arxiv.org/pdf/1408.5823v5.pdf]
> [2] [http://ds.qcri.org/images/profile/tarek_elgamal/sigmod2015.pdf]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)