You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@flink.apache.org by "Thang Nguyen (JIRA)" <ji...@apache.org> on 2016/02/05 02:07:39 UTC

[jira] [Commented] (FLINK-1733) Add PCA to machine learning library

    [ https://issues.apache.org/jira/browse/FLINK-1733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15133447#comment-15133447 ] 

Thang Nguyen commented on FLINK-1733:
-------------------------------------

Thanks [~till.rohrmann], I've been flipping through the Odersky book and it is indeed an excellent resource. 

I have some questions that may be obvious, but their answer seems to elude me for whatever reason...

For context, I have read the sPCA paper a few times and have the Spark implementation of sPCA running locally with a remote debugger hooked up to validate my incremental work.

- Is it fine to use {{breeze.linalg.DenseMatrix}} for this sPCA? Matrix multiplication with {{flink.ml.math.DenseMatrix}} doesn't seem to be implemented as far as I can tell.

- How are DataSets partitioned across nodes, when there isn't a key explicitly specified? Are they evenly distributed based on the size of the DataSet? 

- How does parallel execution on an arbitrarily large DataSet happen from a code perspective? Does the optimizer take care of most of the heavy lifting as long as the code is written in a functional manner? (Asking specifically about the FNormJob/YtXJob in the paper). I am aware of the plan visualizer, however I haven't gotten to that point just yet...


> Add PCA to machine learning library
> -----------------------------------
>
>                 Key: FLINK-1733
>                 URL: https://issues.apache.org/jira/browse/FLINK-1733
>             Project: Flink
>          Issue Type: New Feature
>          Components: Machine Learning Library
>            Reporter: Till Rohrmann
>            Assignee: Thang Nguyen
>            Priority: Minor
>              Labels: ML
>
> Dimension reduction is a crucial prerequisite for many data analysis tasks. Therefore, Flink's machine learning library should contain a principal components analysis (PCA) implementation. Maria-Florina Balcan et al. [1] proposes a distributed PCA. A more recent publication [2] describes another scalable PCA implementation.
> Resources:
> [1] [http://arxiv.org/pdf/1408.5823v5.pdf]
> [2] [http://ds.qcri.org/images/profile/tarek_elgamal/sigmod2015.pdf]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)