You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by ameyamm <am...@outlook.com> on 2015/07/10 03:09:52 UTC

Apache Spark : Custom function for reduceByKey - missing arguments for method

I am trying to normalize a dataset (convert values for all attributes in the
vector to "0-1" range). I created an RDD of tuple (attrib-name,
attrib-value) for all the records in the dataset as follows:

val attribMap : RDD[(String,DoubleDimension)] = contactDataset.flatMap( 
                          contact => { 
                            List(
                              ("dage",contact.dage match { case Some(value)
=> DoubleDimension(value) ; case None => null }),
                              ("dancstry1",contact.dancstry1 match { case
Some(value) => DoubleDimension(value) ; case None => null }),
                              ("dancstry2",contact.dancstry2 match { case
Some(value) => DoubleDimension(value) ; case None => null }),
                              ("ddepart",contact.ddepart match { case
Some(value) => DoubleDimension(value) ; case None => null }),
                              ("dhispanic",contact.dhispanic match { case
Some(value) => DoubleDimension(value) ; case None => null }),
                              ("dhour89",contact.dhour89 match { case
Some(value) => DoubleDimension(value) ; case None => null })
                            )
                          }
                        )

Here, contactDataset is of the type RDD[Contact]. The fields of Contact
class are of type Option[Long].

DoubleDimension is a simple wrapper over Double datatype. It extends the
Ordered trait and implements corresponding compare method and equals method.

To obtain the max and min attribute vector for computing the normalized
values,

maxVector = attribMap.reduceByKey( getMax )
minVector = attribMap.reduceByKey( getMin )

Implementation of getMax and getMin is as follows:

def getMax( a : DoubleDimension, b : DoubleDimension ) : DoubleDimension = {
if (a > b) a 
else b 
}

def getMin( a : DoubleDimension, b : DoubleDimension) : DoubleDimension = {
if (a < b) a 
else b 
}

I get a compile error at calls to the methods getMax and getMin stating:

[ERROR] .../com/ameyamm/input_generator/DatasetReader.scala:117: error:
missing arguments for method getMax in class DatasetReader;

[ERROR] follow this method with '_' if you want to treat it as a partially
applied function

[ERROR] maxVector = attribMap.reduceByKey( getMax )

[ERROR] .../com/ameyamm/input_generator/DatasetReader.scala:118: error:
missing arguments for method getMin in class DatasetReader;

[ERROR] follow this method with '_' if you want to treat it as a partially
applied function

[ERROR] minVector = attribMap.reduceByKey( getMin )

I am not sure what I am doing wrong here. My RDD is an RDD of Pairs and as
per my knowledge, I can pass any method to it as long as the functions is of
the type f : (V, V) => V.

I am really stuck here. Please help.



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Apache-Spark-Custom-function-for-reduceByKey-missing-arguments-for-method-tp23756.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Apache Spark : Custom function for reduceByKey - missing arguments for method

Posted by Richard Marscher <rm...@localytics.com>.

Did you try it by adding the `_` after the method names to partially apply
them? Scala is saying that its trying to immediately apply those methods
but can't find arguments.  But you instead are trying to pass them along as
functions (which they aren't). Here is a link to a stackoverflow answer
that should help clarify: http://stackoverflow.com/a/19720808/72401. I
think there are two solutions, turn the getMax and getMin into functions by
using val ala:

val getMax: (DoubleDimension, DoubleDimension) => DoubleDimension = { (a,b)
=>
  if (a > b) a
  else b
}

val getMin: (DoubleDimension, DoubleDimension) => DoubleDimension = { (a,b)
=>
  if (a < b) a
  else b
}

or just partially apply them:

maxVector = attribMap.reduceByKey( getMax _)
minVector = attribMap.reduceByKey( getMin _)

On Thu, Jul 9, 2015 at 9:09 PM, ameyamm <am...@outlook.com> wrote:

> I am trying to normalize a dataset (convert values for all attributes in
> the
> vector to "0-1" range). I created an RDD of tuple (attrib-name,
> attrib-value) for all the records in the dataset as follows:
>
> val attribMap : RDD[(String,DoubleDimension)] = contactDataset.flatMap(
>                           contact => {
>                             List(
>                               ("dage",contact.dage match { case Some(value)
> => DoubleDimension(value) ; case None => null }),
>                               ("dancstry1",contact.dancstry1 match { case
> Some(value) => DoubleDimension(value) ; case None => null }),
>                               ("dancstry2",contact.dancstry2 match { case
> Some(value) => DoubleDimension(value) ; case None => null }),
>                               ("ddepart",contact.ddepart match { case
> Some(value) => DoubleDimension(value) ; case None => null }),
>                               ("dhispanic",contact.dhispanic match { case
> Some(value) => DoubleDimension(value) ; case None => null }),
>                               ("dhour89",contact.dhour89 match { case
> Some(value) => DoubleDimension(value) ; case None => null })
>                             )
>                           }
>                         )
>
> Here, contactDataset is of the type RDD[Contact]. The fields of Contact
> class are of type Option[Long].
>
> DoubleDimension is a simple wrapper over Double datatype. It extends the
> Ordered trait and implements corresponding compare method and equals
> method.
>
> To obtain the max and min attribute vector for computing the normalized
> values,
>
> maxVector = attribMap.reduceByKey( getMax )
> minVector = attribMap.reduceByKey( getMin )
>
> Implementation of getMax and getMin is as follows:
>
> def getMax( a : DoubleDimension, b : DoubleDimension ) : DoubleDimension =
> {
> if (a > b) a
> else b
> }
>
> def getMin( a : DoubleDimension, b : DoubleDimension) : DoubleDimension = {
> if (a < b) a
> else b
> }
>
> I get a compile error at calls to the methods getMax and getMin stating:
>
> [ERROR] .../com/ameyamm/input_generator/DatasetReader.scala:117: error:
> missing arguments for method getMax in class DatasetReader;
>
> [ERROR] follow this method with '_' if you want to treat it as a partially
> applied function
>
> [ERROR] maxVector = attribMap.reduceByKey( getMax )
>
> [ERROR] .../com/ameyamm/input_generator/DatasetReader.scala:118: error:
> missing arguments for method getMin in class DatasetReader;
>
> [ERROR] follow this method with '_' if you want to treat it as a partially
> applied function
>
> [ERROR] minVector = attribMap.reduceByKey( getMin )
>
> I am not sure what I am doing wrong here. My RDD is an RDD of Pairs and as
> per my knowledge, I can pass any method to it as long as the functions is
> of
> the type f : (V, V) => V.
>
> I am really stuck here. Please help.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Apache-Spark-Custom-function-for-reduceByKey-missing-arguments-for-method-tp23756.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>


-- 
-- 
*Richard Marscher*
Software Engineer
Localytics
Localytics.com <http://localytics.com/> | Our Blog
<http://localytics.com/blog> | Twitter <http://twitter.com/localytics> |
Facebook <http://facebook.com/localytics> | LinkedIn
<http://www.linkedin.com/company/1148792?trk=tyah>