You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2014/07/07 21:32:36 UTC

[jira] [Commented] (MAHOUT-1583) cbind() operator for Scala DRMs

    [ https://issues.apache.org/jira/browse/MAHOUT-1583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14054084#comment-14054084 ] 

ASF GitHub Bot commented on MAHOUT-1583:
----------------------------------------

Github user dlyubimov commented on a diff in the pull request:

    https://github.com/apache/mahout/pull/20#discussion_r14617005
  
    --- Diff: spark/src/main/scala/org/apache/mahout/sparkbindings/blas/CbindAB.scala ---
    @@ -0,0 +1,95 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.mahout.sparkbindings.blas
    +
    +import org.apache.log4j.Logger
    +import scala.reflect.ClassTag
    +import org.apache.mahout.sparkbindings.drm.DrmRddInput
    +import org.apache.mahout.math._
    +import scalabindings._
    +import RLikeOps._
    +import org.apache.mahout.math.drm.logical.OpCbind
    +import org.apache.spark.SparkContext._
    +
    +/** Physical cbind */
    +object CbindAB {
    +
    +  private val log = Logger.getLogger(CbindAB.getClass)
    +
    +  def cbindAB_nograph[K: ClassTag](op: OpCbind[K], srcA: DrmRddInput[K], srcB: DrmRddInput[K]): DrmRddInput[K] = {
    +
    +    val a = srcA.toDrmRdd()
    +    val b = srcB.toDrmRdd()
    +    val n = op.ncol
    +    val n1 = op.A.ncol
    +    val n2 = n - n1
    +
    +    // Check if A and B are identically partitioned AND keyed. if they are, then just perform zip
    +    // instead of join, and apply the op map-side. Otherwise, perform join and apply the op
    +    // reduce-side.
    +    val rdd = if (op.isIdenticallyPartitioned(op.A)) {
    +
    +      log.debug("applying zipped cbind()")
    +
    +      a
    +          .zip(b)
    +          .map {
    +        case ((keyA, vectorA), (keyB, vectorB)) =>
    +          assert(keyA == keyB, "inputs are claimed identically partitioned, but they are not identically keyed")
    +
    +          val dense = vectorA.isDense && vectorB.isDense
    +          val vec: Vector = if (dense) new DenseVector(n) else new SequentialAccessSparseVector(n)
    --- End diff --
    
    Not sure why you are saying this. If I am not misinterpreting what you are saying, neither theoretical nor practical estimate supports this. Added an assignment benchmark with and without openhash intermediary. here are the results of running it: 
    
        Testing started at 12:26 PM ...
        Average assignment seqSparse2seqSparse time: 29.673 ms
        Average assignment seqSparse2seqSparse via Random Access Sparse time: 406.510 ms
    
    This of course assumes that we, for the most part, are having SeqSparse vectors, not RandomAccessSparse ones as payload (which we always are, unless somebody explicitly messes it up with a `mapBlock`.) 
       


> cbind() operator for Scala DRMs
> -------------------------------
>
>                 Key: MAHOUT-1583
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1583
>             Project: Mahout
>          Issue Type: Task
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>             Fix For: 1.0
>
>
> Another R-like operator, cbind (stitching two matrices together). Seems to come up now and then. 
> Just like with elementwise operations, and, perhaps some other, it will have two physical implementation paths, one is zip for identically distributed operators, and another one is full join in case they are not.



--
This message was sent by Atlassian JIRA
(v6.2#6252)