You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@systemml.apache.org by "Matthias Boehm (JIRA)" <ji...@apache.org> on 2016/09/27 05:46:20 UTC
[jira] [Comment Edited] (SYSTEMML-831) Implement t-SNE algorithm

    [ https://issues.apache.org/jira/browse/SYSTEMML-831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15525147#comment-15525147 ] 

Matthias Boehm edited comment on SYSTEMML-831 at 9/27/16 5:45 AM:
------------------------------------------------------------------

ok, I had a closer look and here is the high-level summary: the problematic expression is indeed {{X %*% t(X)}} (in function {{distance_matrix}}). Given the 60Kx784 input, this output matrix is 60Kx60K, i.e., 29GB in dense. The size itself is not problematic, but because the input is tiny, we only have 2 input partitions (and hence 2 output partitions), which gives us 15GB per partition which exceeds the 2GB limitation of partitions (on block manager or shuffle).

I now added a new robustness features to our mapmm (broadcast-based matrix multiply, which is often the source of these outer-product-like matrices). We now automatically repartition the input if the output size per partition is expected to be large. Running in -exec hybrid_spark and with 20GB driver and 55GB executors, this already solved the issue. The patch will be available in master tomorrow. However, note that a single matrix block (of blocksize 1K) is the smallest granularity for repartitioning - in this scenario a single block already outputs almost 500MB. Hence, even this approach is somewhat limited. As a workaround you can further reconfigure the block size in SystemML-config.xml to increase the potential degree of parallelism. But even then, the problem remains O(n^2), which will inevitable run into issues on large input data.

Furthermore, the script also exhibits further optimization opportunities. For example, the entire inner for loop over the number of rows can be vectorized, i.e. conditional assignments can be rewritten to something like {{betamin = (Hdiff>0)*beta + (Hdiff<=0)*betamin}}. Let me know if you need help with that.


was (Author: mboehm7):
ok, I had a closer look and here is the high-level summary: the problematic expression is indeed {{X %*% t(X)}} (in function {{distance_matrix}}). Given the 60Kx784 input, this output matrix is 60Kx60K, i.e., 29GB in dense. The size itself is not problematic, but because the input is tiny, we only have 2 input partitions (and hence 2 output partitions), which gives us 15GB per partition which exceeds the 2GB limitation of partitions (on block manager or shuffle).

I now added a new robustness features to our mapmm (broadcast-based matrix multiply, which is often the source of these outer-product-like matrices). We now automatically repartition the input if the output size per partition is expected to be large. Running in -exec hybrid_spark and with 20GB driver and 55GB executors, this already solved the issue. The patch will be available in master tomorrow. However, note that a single matrix block (of blocksize 1K) is the smallest granularity for repartitioning - in this scenario a single block already outputs almost 500MB. Hence, even this approach is somewhat limited. As a workaround you can further reconfigure the block size in SystemML-config.xml to increase the potential degree of parallelism. But even then, the problem remains O(n^2), which will inevitable run into issues on large input data.

Furthermore, the script also exhibits further optimization opportunities. For example, the entire inner for loop over the number of rows can be vectorized, i.e. conditional assignments can be rewritten to something like {{ betamin = (Hdiff>0)*beta + (Hdiff<=0)*betamin}}. Let me know if you need help with that.

> Implement t-SNE algorithm
> -------------------------
>
>                 Key: SYSTEMML-831
>                 URL: https://issues.apache.org/jira/browse/SYSTEMML-831
>             Project: SystemML
>          Issue Type: Improvement
>          Components: Algorithms
>            Reporter: Imran Younus
>            Assignee: Imran Younus
>         Attachments: out_2016_09_26_10.log
>
>
> This jira implements the t-distributed Stochastic Neighbor Embedding algorithm for dimensionality reduction presented in this paper:
> Visualizing Data using t-SNE 
> by Laurens van der Maaten, Geoffrey Hinton
> http://www.jmlr.org/papers/v9/vandermaaten08a.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)