You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Dmitriy Lyubimov (JIRA)" <ji...@apache.org> on 2015/06/08 22:59:00 UTC

[jira] [Updated] (MAHOUT-1722) DRM row sampling api

     [ https://issues.apache.org/jira/browse/MAHOUT-1722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dmitriy Lyubimov updated MAHOUT-1722:
-------------------------------------
    Description: 
We will ask engines to support two tiny apis for row vector sampling. 

One api is uniform multivariate hypergeometric (k parameter is given), and another is by fraction (simple map-only probabilistic filter). Spark implementation is enclosed (Spark just has an api for both, albeit k-sampler does not have strict mathematical guarantee of the distribution, and is only for small k).

challenge here is that returned rows should be ordinally renumbered.

(maybe i need to revisit this issue later, this was a pretty hasty API change, might be less than ideal in general case).

PR https://github.com/apache/mahout/pull/135

  was:
We will ask engines to support two tiny apis for row vector sampling. 

One api is uniform multivariate hypergeometric (k parameter is given), and another is by fraction (simple map-only probabilistic filter). Spark implementation is enclosed (Spark just has an api for both, albeit k-sampler does not have strict mathematical guarantee of the distribution, and is only for small k).

challenge here is that returned rows should be ordinally renumbered.

(maybe i need to revisit this issue later, this was a pretty hasty API change, might be less than ideal in general case).


> DRM row sampling api
> --------------------
>
>                 Key: MAHOUT-1722
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1722
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>             Fix For: 0.10.2
>
>
> We will ask engines to support two tiny apis for row vector sampling. 
> One api is uniform multivariate hypergeometric (k parameter is given), and another is by fraction (simple map-only probabilistic filter). Spark implementation is enclosed (Spark just has an api for both, albeit k-sampler does not have strict mathematical guarantee of the distribution, and is only for small k).
> challenge here is that returned rows should be ordinally renumbered.
> (maybe i need to revisit this issue later, this was a pretty hasty API change, might be less than ideal in general case).
> PR https://github.com/apache/mahout/pull/135



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)