You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2016/04/21 03:44:25 UTC
[jira] [Commented] (MAHOUT-1833) Enhance svec function to accept
cardinality as parameter
[ https://issues.apache.org/jira/browse/MAHOUT-1833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15251090#comment-15251090 ]
ASF GitHub Bot commented on MAHOUT-1833:
----------------------------------------
Github user resec commented on the pull request:
https://github.com/apache/mahout/pull/224#issuecomment-212690207
@dlyubimov, thanks for the great detailed explanation.
I think I don't have privileges to edit this [In Core Reference Page](http://mahout.apache.org/users/environment/in-core-reference.html), so I guess somebody may need to help.
And for the authoritative doc in another branch, I think I can help, will submit another PR accordingly soon.
> Enhance svec function to accept cardinality as parameter
> ---------------------------------------------------------
>
> Key: MAHOUT-1833
> URL: https://issues.apache.org/jira/browse/MAHOUT-1833
> Project: Mahout
> Issue Type: Improvement
> Affects Versions: 0.12.0
> Environment: Mahout Spark Shell 0.12.0,
> Spark 1.6.0 Cluster on Hadoop Yarn 2.7.1,
> Centos7 64bit
> Reporter: Edmond Luo
>
> It will be nice to enhance the existing svec function in org.apache.mahout.math.scalabindings
> {code}
> /**
> * create a sparse vector out of list of tuple2's
> * @param sdata cardinality
> * @return
> */
> def svec(sdata: TraversableOnce[(Int, AnyVal)], cardinality: Int = -1) = {
> val required = if (sdata.nonEmpty) sdata.map(_._1).max + 1 else 0
> var tmp = -1
> if (cardinality < 0) {
> tmp = required
> } else if (cardinality < required) {
> throw new IllegalArgumentException(s"Required cardinality %required but got %cardinality")
> } else {
> tmp = cardinality
> }
> val initialCapacity = sdata.size
> val sv = new RandomAccessSparseVector(tmp, initialCapacity)
> sdata.foreach(t ⇒ sv.setQuick(t._1, t._2.asInstanceOf[Number].doubleValue()))
> sv
> }
> {code}
> So user can specify the cardinality for the created sparse vector.
> This is very useful and convenient if user wants to create a DRM with many sparse vectors and the vectors are not with the same actual size(but with the same logical size, e.g. rows of a sparse matrix).
> Below code should demonstrate the case:
> {code}
> var cardinality = 20
> val rdd = sc.textFile("/some/file.txt").map(_.split(",")).map(line => (line(0).toInt, Array((line(1).toInt,1)))).reduceByKey((v1, v2) => v1 ++ v2).map(row => (row._1, svec(row._2,cardinality)))
> val drm = drmWrap(rdd.map(row => (row._1, row._2.asInstanceOf[Vector])))
> // All below element wise opperations will fail for those DRM with not cardinality-consistent SparseVector
> val drm2 = drm + drm.t
> val drm3 = drm - drm.t
> val drm4 = drm * drm.t
> val drm5 = drm / drm.t
> {code}
> Notice that in the last map, the svec acceptted one more cardinality parameter, so the cardinality of those created sparse vectors can be consistent.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)